English-built ‘boneshaker’ bicycle
Image Source: https://www.jstor.org/stable/community.26290945

The basic idea of inferential statistics is to use a statistic (mean,Standard Deviation etc.) calculated on a sample in order to estimate a parameter of a population (mean,Standard Deviation etc.). However, we have to accept an unavoidable fact that value we get for any statistic will vary from sample to sample, even when all samples are selected from same parent population. We can use sampling distribution to find sample statistic and based on sample statistic we can draw inference about the population parameter.

In this tutorial, we will understand the process of sampling distribution and its important characteristics that lead to one of the most significant theorem in statistics – The Central Limit theorem.

Sampling Distribution

Sampling distribution can be explained as below. If we randomly choose and list all sample size of N from the parent population, and for each sample, we record the value of statistic (for e.g. for each sample we find out the mean of that sample) we are interested in. We select all the samples with replacement. (i.e. each sample size N is replaced in the population before next sample size N is selected.) If we take all the values we have obtained for this statistic (for e.g. all mean values of the samples), we can construct frequency distribution of them. Such a distribution is called as sampling distribution. If you are getting lost in this meandering definition, here is a graphics which sums up the process:

From the population of size N, draw the different samples of size n, (n<N) randomly.
Let the samples be (s1,n), (s2,n), (s3,n)….(sn,n)
.With this sample data evaluate sample statistic (sample mean/sample SD) for each sample.
Calculate the frequency distribution of sample statistic. This frequency distribution is called as sample distribution.

The mean is said to be mean of sample mean. The standard deviation of this sampling distribution based on mean is known as standard error of distribution.

Sampling distribution for sample size, n=2

The following data shows the time in minutes, it takes to fill orders at fast food chain drive-through lane.

service_time <- c(2,4,4,6,7,8,8,8,9,10)

Using above population data, we will calculate sampling distribution of means for samples of size N=2. The mean value of the population is 6.6 minutes.

In the following table, we list all possible random samples of size N=2 that could possible be drawn from this population, and for each one, we calculate its sample mean, X_bar

We then calculate frequency distribution of these sample means and create corresponding frequency histogram. The histogram produces a distribution of mean value called as sampling distribution.

We can create sampling distribution in R. We will use sur package, which is companion to Statistics using R: An integrated approach book.

# install.packages("sur") # install sur package
library(sur)

set.seed(123) # initiate random numbers

b.mean <-  boot.mean(service_time,B=100,n=2)$bootstrap.samples

table(b.mean) # frequency table

The frequency table is as below:

The ‘boot.mean()’ function provides a sampling distribution of the ‘sample_time’ variable with 100 samples, each of size n=2.

The ‘boot.mean()’ will randomly sample with replacement a sample of size 2 from our data-set of size 10, compute the mean of these 2 randomly selected values, and save the mean to the variable b.mean.

We can construct the histogram to check the sampling distribution.

hist(b.mean,xlab="means",main="")

According to frequency distribution, the mean value of population is most likely to fall in the range of 4 to 8 mins.

Sampling distribution for sample size, n=8

Let’s change the sample size, n=8. How the sample distribution will change? We can use boot.mean() function again and create histogram.

b.mean1 <-  boot.mean(service_time,B=100,n=8)$bootstrap.samples

table(b.mean1)# frequency table

The frequency table is as below:

According to frequency distribution, the mean value of population is most likely to fall in the range of 5.5 to 7.5 mins.

Based on the comparison of two sample sizes, we find that the range of mean values of population decreases as the sample size increases. Thus, we find that as we increase the size of the sample, we expect to obtain more accurate estimate of the population mean. As a general rule, we we are using sample statistics to estimate population parameters, the larger the sample we use, the more accurate we can expect the estimate to be. We can also observe that larger the sample size, the shape of the sampling distribution resembles the normal distribution.

Standard Error of the Mean

In this small example, although the sample means vary from sample to sample, depending on which two order fill time are selected, the sample means do not vary as much as individual values in the population. A population consists of wide range of values from extremely small to extremely large. However, if the sample contains an extreme value, its effect is reduced because the values is averaged with all the other values in the sample.

The value of the standard deviation of all possible sample means, called as the standard error of the mean. This value indicates how the sample means vary from sample to sample. As the sample size increases, the standard error of the mean decreases by a factor equal to square root of the sample size.

# The standard error with sample size $n=2$
boot.mean(service_time,B=100,n=2)$se
[1] 1.718078

# The standard error with sample size $n=8$
boot.mean(service_time,B=100,n=8)$se
[1] 0.9016644

The standard error with sample size n=2 is: 1.718
The standard error with sample size n=8 is: 0.901

The Central Limit theorem

The population we used in our example consists of particular distribution. It can be shown that, if we had begun with different distribution, we would have obtained the similar result. That is, regardless of population distribution of order fill time, as the sample size increases, the estimate of the population mean becomes more accurate and shape of the sampling distribution of means becomes much more like that of normal distribution.

This is what is called as central limit theorem.

“As the sample size (the number of values in each sample) gets large enough, the sampling distribution of the mean is approximately normally distributed. this is true regardless of the shape of the distribution of the individual values in the population”
Page 303,Statistics for Managers,2022

Let’s look at an example of uniform distribution. The values are evenly distributed from smallest to largest values.

As the sample size is increased, the standard error of sample means is reduced, and the shape of the graphs become more like a normal distribution. This effect happens initially more slowly, but as the sample size is increased to more than 30, the sampling distribution shape converge to normal distribution curve.

par(mfrow = c(2, 2))
hist(boot.mean(runif(1000,10,100),B=100,n=8)$bootstrap.samples,xlab="mean",main="sample size 8")
hist(boot.mean(runif(1000,10,100),B=100,n=30)$bootstrap.samples,xlab="mean",main="sample size 30")
hist(boot.mean(runif(1000,10,100),B=100,n=60)$bootstrap.samples,xlab="mean",main="sample size 60")
hist(boot.mean(runif(1000,10,100),B=100,n=200)$bootstrap.samples,xlab="mean",main="sample size 200")

Sample distribution for uni-formally distributed data

Here are some empirical observations on the central limit theorem:

If the distribution of the population is fairly symmetrical, the sampling distribution of the mean is approximately normal for samples as small as size 5
For most distributions, regardless of the shape of population, the sampling distribution of mean is approximately normally distributed if samples of at least size 30 are selected.
If the population is normally distributed, the sampling distribution of the mean is normally distributed, regardless of the sample size.

Summary:

Sampling distribution plays key role in inferential statistics. With the help of sampling distribution we can estimate the range of population parameter. As the sample size increases, the error in sampling distribution – called standard error reduces.

As the sample size gets large enough, the sampling distribution of the mean is approximately normally distributed. This is called as central limit theorem.

The central limit theorem is of crucial importance in using statistical inference to reach conclusion about the population.

An intuitive understanding of Sampling distribution and central limit theorem using practical example