Confidence Intervals for Means

As we previously noted, a point estimate is almost certain to be incorrect, but an interval estimate can have a high probability of containing the correct value of the population parameter. Let us consider the situation of obtaining a confidence interval for a mean.

Identifying the Correct Formula

When the population standard deviation $\sigma$ is known, the sample mean distribution will follow a normal distribution. However, that is rarely the case in practice. When the population standard deviation $\sigma$ is unknown, and the sample standard deviation $s$ is used in its place, the sample mean distribution will follow the Student's t-distribution with $df = n-1$ degrees of freedom.

To identify the correct formula for the confidence interval of the population mean, we need to determine whether the population standard deviation is known or not, whether the sample size is large or not, and whether the original population was normally distributed or not. In each case, we assume a sample of size $n$ had a sample mean of $\bar{x}$.

If the population standard deviation $\sigma$ is known, ...

... and the sample size is large ($n \ge 30$), then use the normal distribution and the formula $\bar{x} \pm z_{\alpha/2} \dfrac{\sigma}{\sqrt{n}}$.
... and the sample size is small ($n < 30$), ...

... and the original population is normal, then use the normal distribution and the formula $\bar{x} \pm z_{\alpha/2} \dfrac{\sigma}{\sqrt{n}}$.
... and the original population is not normal, then a nonparametric method is needed. A confidence interval could be constructed for the median instead.

If the population standard deviation $\sigma$ is unknown, so that the sample standard deviation $s$ must be used instead, ...
- ... and the sample size is large ($n \ge 30$), then use the t-distribution with $df = n - 1$ degrees of freedom and the formula $\bar{x} \pm t_{\alpha/2} \dfrac{s}{\sqrt{n}}$.
- ... and the sample size is small ($n < 30$), ...

A Sample with a Known Population Standard Deviation

A sample of 150 county residents finds the mean educational level to be 12.2 years. Suppose it is known that $\sigma = 3.1$ years. Find a 95% confidence interval for the population mean.

Since $\sigma$ is known, and the sample size $n$ is large, we will use the standard normal distribution.
The standard error is $\sigma_{\bar{x}} = \dfrac{\sigma}{\sqrt{n}} = \dfrac{3.1}{\sqrt{150}} \approx 0.253$ years.
For a 95% confidence interval, we have $\alpha = 0.05$, so $z_{\alpha/2} = z_{0.025} = 1.96$.
Then $\bar{x} \pm z_{\alpha/2} \dfrac{\sigma}{\sqrt{n}} = 12.2 \pm (1.96)(0.253) = 12.2 \pm 0.5$, so the interval is $[11.7, 12.7]$.

In other words, we can be 95% confident that the mean educational level of the population falls between 11.7 years and 12.7 years.

This situation is really quite unrealistic. If you need to sample in order to obtain an estimate for the mean, how is it that you already know the standard deviation? Yet this type of problem is still valuable pedagogically, for it provides us with a better understanding of the relationship between the quantities we observe and the confidence obtained in our interval estimates.

A Large Sample

The manager of a tree nursery measures the heights of 39 seedlings and obtains a mean and standard deviation of 58.5 cm and 4.3 cm, respectively. Find the 90% confidence interval for the population mean.

Since $\sigma$ is unknown, and the sample size $n$ is large, we will use the Student's t-distribution.
The standard error is $\dfrac{s}{\sqrt{n}} = \dfrac{4.3}{\sqrt{39}} \approx 0.689$ cm.
With 38 degrees of freedom and a 90% confidence interval, we have $\alpha = 0.10$, so $t_{\alpha/2} = t_{0.05} = 1.686$.
Then $\bar{x} \pm t_{\alpha/2} \dfrac{s}{\sqrt{n}} = 58.5 \pm (1.686)(0.689) = 58.5 \pm 1.2$, so the interval is $[57.3, 59.7]$.

We can be 90% confident that the mean height of the seedlings is between 57.3 cm and 59.7 cm.

Before the advent of powerful calculation tools, statisticians frequently used the normal distribution as an estimate of the t-distribution when the sample size was large. If we had done this, we would have obtained a confidence interval of heights between 57.4 and 59.8 cm. The result would have been close, but the difference gives a false impression of slightly greater precision in our estimate.

A Sample from a Normal Population

A researcher studies the lengths (in meters) of skids on a road, and obtains the following data.

438, 303, 408, 155, 237, 366, 383, 521, 381, 492
329, 247, 144, 203, 226, 351, 265, 264, 103, 372

Find the 95% confidence interval for the mean length of a skid.

Since $\sigma$ is unknown, and the sample size is small ($n = 20$), we need to determine whether the original population is normal. This we do by examining a normal probability plot of the data.

Since the normal probability plot is roughly linear, we can assume that the original population data is approximately normal, and therefore we can use the Student's t-distribution.
From the data, we compute the mean and standard deviation to be $\bar{x} = 309.4$ meters and $s = 114.6$ meters.
The standard error is $\dfrac{s}{\sqrt{n}} = \dfrac{114.6}{\sqrt{20}} \approx 25.625$ meters.
With 19 degrees of freedom and a 95% confidence interval, we have $\alpha = 0.05$, so $t_{\alpha/2} = t_{0.025} = 2.093$.
Then $\bar{x} \pm t_{\alpha/2} \dfrac{s}{\sqrt{n}} = 309.4 \pm (2.093)(25.625) = 309.4 \pm 53.6$, so the interval is $[255.8, 363.0]$.

We can be 95% confident that the mean skid length is between 255.8 meters and 363.0 meters.

Sample Size

Given a particular sample size, we should notice that to increase our confidence in our estimate, we had to enlarge the interval, thus decreasing the precision with which we know the result. And similarly, increasing the precision will decrease the confidence. If we want both the confidence in being correct, and a fairly precise answer, then we need to increase the sample size.

Sample size estimates are usually done with the normal distribution, even though the population standard deviation may not be known. This is because the t-distribution does approach the normal distribution as the sample size increases, and is approximately normal when the sample is large ($n \ge 30$).

The confidence interval formula was $\bar{x} \pm z_{\alpha/2} \dfrac{\sigma}{\sqrt{n}}$. The margin of error is defined to be the distance from the center to the end of the interval, and is denoted by the variable $E$. Therefore, we have the equation $E = z_{\alpha/2} \dfrac{\sigma}{\sqrt{n}}$. This equation can be solved for $n$ to produce the following formula.

$n = \left( \dfrac{ z_{\alpha/2} \sigma}{E} \right)^2$

It is common practice to round the result up to the nearest integer. If the population standard deviation $\sigma$ is unknown, then the sample standard deviation $s$ should be substituted for $\sigma$. If no standard deviation is known because the data has not yet been collected, a small pilot sample (with sample size at least 30) should be collected to estimate the standard deviation.

As an example, suppose the learning time for a particular task has a standard deviation (estimated from at least 30 observations) of 4.3 minutes. What sample size would be needed to estimate the mean learning time to within 0.5 minutes, with a confidence of 99%?

For a 99% confidence interval, we have $\alpha = 0.01$, so $z_{\alpha/2} = z_{0.005} = 2.576$.
Then $n = \left( \dfrac{ z_{\alpha/2} \sigma}{E} \right)^2 = \left( \dfrac{(2.576)(4.3)}{0.5} \right)^2 = 22.15^2 = 490.8$.

Rounding up, we find that we should have 491 observations in our sample in order to achieve an estimate with a margin of error at most 0.5 minutes, and have 99% confidence in the result.