In my last article we learned how to estimate population mean by using sample mean when population standard deviation is known. In most instances, the population standard deviation will be unknown and thus business analyst won’t able to estimate the population mean.

When the population standard deviation is unknown, the sample standard deviation must be used in the estimation process. In this article, we will learn a statistical technique to estimate a population mean using sample mean when the population standard deviation is unknown.

Suppose a production manager at a compact fluorescent light bulb (CFL) factory needs to estimate the mean life of CFL bulbs. The company manufactures milions of light bulbs in a month. In this case we don’t know what is the average life of enitre population of light bulbs, and hence a population standard deviation. In such situation, operations manager can compute the sample mean and sample standard deviation from which the estimate can be constructed.

The z-formulas presented in the last article are not useful when population standard deviation is unknown. Instead, another mechanism to handle such cases was developed by British statistician, William S Gosset.

**The t-distribution**

Gosset developed the *t-distribution*, which is used instead of the z-distribution for performing inferential statistics on the population mean when population standard deviation is unknwon and the population is normally distributed. The formula for the *t-test* is:

The assumption underlying the use of the t-statistic is that the population is normally distributed.

**Characteristics of t-distribution**

The figure displays two t-distributions superimposed on the standard normal distribution. Like the standard normal curve, t distribution are symmetric and unimodal. The t-distributions are flatter in the middle and have more area in their tails than the standard normal distribution. The mean of the t-distribution is zero, but its standard deviation is not equal to 1.

When Gosset conducted the experiments on t-distribution, he found out that family of t-distributions varies as per their degrees of freedom symbolized as *df*

The term degrees of freedom refers to *the number of independent observations for a source of variations minus the number of independent parameters estimated in computing the variation*. For e.g. a sample of five values has a mean of 20. How many values do you need to know before you can determine remainder of the values? Since the sample mean is 20 and n=5, the sum of numbers must be 100. Thus, you can freely choose 4 values, but the fifth values must be constrained so that sum is 100.

In case of t-distribution, X-bar requires calculating the sample mean, hence degree of freedom is *df=N-1*

For t-distribution with more than two degrees of freedom (*df>2*), the variance is:

The smaller the value of *df*, the larger the variance, and as the *df* becomes large, the variance of t-distribution approaches to 1.

```
xvalues <- seq(-4,4,0.01)
plot(xvalues,dnorm(xvalues),type="l",ylab="probability density",xlab="standard deviation")
lines(xvalues,dt(xvalues,df=5),col="red")
lines(xvalues,dt(xvalues,df=10),col="blue")
legend("topright",c("normal distribution","t-distribution(n=5)",
"t-distribution(n=10)"),fill = c("black","red","blue"))
```

**The Example**

A fast-food chain offers low priced combination meal to attract budget-concious customers. Suppose chain wants to estimate average amount its customers spent on a meal at their restaurant while low priced combination meal offer is in effect. Here is a data from 28 randomly selected customers.

```
spend_data <- c(3.21,5.40,5.5,4.39,5.6,8.65,5.02,4.20,1.25,7.64,3.28,
5.57,3.26,3.80,5.46,9.87,4.67,5.86,3.73,4.08,5.47,4.49,
5.19,5.82,7.62,4.83,8.42,9.10)
```

Since population mean and standard deviation is not known, we will use t-distribution to calculate the mean amount its customer spent on a meal.

We will first have to assume confidence interval (alpha) while calculating the mean value. We will assume 95% confident interval.

Since t-distribution assumes the normality of the data, it is always recommended to check the normality of the data before applying t-distribution. We will use shapiro-Wilk statistical normality test to check the normality of the data.

```
shapiro.test(spend_data)
Shapiro-Wilk normality test
data: spend_data
W = 0.94291, p-value = 0.1311
```

Since p-value is more than $\alpha$, we fail to reject the Null Hypothesis and conclude that the data is Normally distributed.

t-distribution uses area in the tail of the distribution in its calculation, and each tail of the distribution contains *alpha* of the area under the curve when confidence intervals are constructed. If 95% confidence interval is being constructed, the total area in the two tails is 5%. Thus *alpha* is 0.025.

In R, the confidence interval with t-distribution is calculated using `t.test()`

function. We will have to pass on the vector of observations and alpha value.

```
t.test(spend_data,conf.level = 0.95)$conf.int
[1] 4.637065 6.175792
attr(,"conf.level")
[1] 0.95
```

With the 95% confidence level, the average amount customer spent on a meal in the restaurant is between 4.63 dollars and 6.17 dollars.

**Summary**

When the population standard deviation is unknown, the sample standard deviation must be used in the estimation process. t-distribution is used to calculate the population mean when population standard deviation is unknown. It uses sample mean and sample standard deviation to calculate the population mean.