Sample Variance Distributions

Variance is the second moment of the distribution about the mean. Since we have seen that squared standard scores have a chi-square distribution, we would expect that variance would also.

The Theory

We begin by letting $X$ be a random variable having a normal distribution. We recall the definitions of population variance and sample variance.

\begin{align} \sigma^2 &= \dfrac{ \sum\limits_{i=1}^N (x_i - \mu)^2}{N} \\ s^2 &= \dfrac{\sum\limits_{i=1}^n (x_i - \bar{x})^2}{n-1} \end{align}

The differences in these two formulas involve both the mean used ($\mu$ vs. $\bar{x}$), and the quantity in the denominator ($N$ vs. $n - 1$). We must keep both of these in mind when analyzing the distribution of variances. Also, we recognize that the value of $s^2$ depends on the sample chosen, and is therefore a random variable that we designate $S^2$.

We shall use the population standard score formula $z = \dfrac{x_i - \mu}{\sigma}$ to develop a relationship between population and sample variances. In particular, we have

\begin{align} \sum\limits_{i=1}^n \left( \dfrac{x_i - \mu}{\sigma} \right)^2 &= \sum\limits_{i=1}^n \left( \dfrac{x_i - \bar{x}}{\sigma} + \dfrac{\bar{x} - \mu}{\sigma} \right)^2 \\ &= \sum\limits_{i=1}^n \left( \dfrac{x_i - \bar{x}}{\sigma} \right)^2 + \dfrac{2}{\sigma^2} (\bar{x} - \mu) \sum\limits_{i=1}^n (x_i - \bar{x}) + \sum\limits_{i=1}^n \left( \dfrac{\bar{x} - \mu}{\sigma} \right)^2 \end{align}

In the last line of the expression above, the second sum is the sum of the deviations of a set of data, and that sum is always zero. The third sum involves a constant being added $n$ times. Therefore, we obtain the following result.

\begin{align} \sum\limits_{i=1}^n \left( \dfrac{x_i - \mu}{\sigma} \right)^2 &= \dfrac{1}{\sigma^2} \sum\limits_{i=1}^n (x_i - \bar{x})^2 + n \left( \dfrac{\bar{x} - \mu}{\sigma} \right)^2 \\ &= \dfrac{1}{\sigma^2} \sum\limits_{i=1}^n (x_i - \bar{x})^2 + \left( \dfrac{\bar{x} - \mu}{\sigma / \sqrt{n}} \right)^2 \end{align}

The expression on the left hand side of the equation is the sum of the squared standard scores of a normal distribution, and therefore has a chi-square distribution with $n$ degrees of freedom. The last term on the right hand side of the equation is the squared standard score of the distribution of sample means whose population was normally distributed, and therefore this sum also has a chi-square distribution, but with one degree of freedom.

Recall that the moment generating function of a chi-square distribution is $M(t) = (1-2t)^{-r/2}$, for $r$ degrees of freedom. Furthermore, when independent random variables are added, their moment generating functions are multiplied. Thus, the moment generating function of the first term on the right side of the equation above can be found from the following relation.

\begin{equation} (1-2t)^{-n/2} = M(t) (1-2t)^{-1/2} \end{equation}

Therefore, the moment generating function of the expression $\dfrac{1}{\sigma^2} \sum\limits_{i=1}^n (x_i - \bar{x})^2$ is $M(t) = (1-2t)^{-(n-1)/2}$. We recognize this moment generating function as belonging to a chi-square distribution of degree $n-1$. But since

\begin{equation} \dfrac{1}{\sigma^2} \sum\limits_{i=1}^n (x_i - \bar{x})^2 = \dfrac{n-1}{\sigma^2} \sum\limits_{i=1}^n \dfrac{(x_i - \bar{x})^2}{n-1} = \dfrac{(n-1)s^2}{\sigma^2} \end{equation}

we find that the random variable $\dfrac{(n-1)S^2}{\sigma^2}$ has a chi-square distribution with $n-1$ degrees of freedom.

Degrees of Freedom

We have been treating degrees of freedom as simply a parameter in the chi-square distribution. In fact, it has a very important interpretation, and measures the number of independent variables in the expression being measured. A more rigorous mathematical discussion would equate the independent variables with vectors in a vector space, but in this exposition we shall simply relate the important characteristics that result.

In the quantity $(x_i - \bar{x})^2$, the sample mean was actually the sum of the $n$ data values $x_i$. If we know the value of $\bar{x}$ and begin examining our sample, once we have measured $n-1$ of the data values $x_i$, we will automatically know the last value. Specifically, we have $x_n = \bar{x} - \sum\limits_{i=1}^{n-1} x_i$. The last value, $x_n$, is not free to vary. Therefore the quantity $\sum\limits_{i=1}^n (x_i - \bar{x})^2$ has $n-1$ degrees of freedom.

The quantity $(x_i - \mu)^2$ behaves differently. The data values $x_i$ are from a sample, but $\mu$ describes the population mean. Knowing $\mu$ and $n-1$ values of $x_i$ does not force a particular value of $x_n$. The last value $x_n$ is still free to vary. Therefore, the quantity $\sum\limits_{i=1}^n (x_i - \mu)^2$ has $n$ degrees of freedom.

And in the quantity $(\bar{x} - \mu)^2$, knowing the value of $\mu$ will not tell us anything about the value of $\bar{x}$. Therefore, the quantity $\bar{x}$ is free to vary. This mathematical expression does not explicitly depend on knowledge of particular data values, so there is only one unknown quantity, $\bar{x}$, that needs to be determined. This expression has one degree of freedom.

An Example

Suppose the weights of bags of flour are normally distributed with a population standard deviation of $\sigma = 1.2$ ounces. Find the probability that a sample of 200 bags would have a standard deviation between 1.1 and 1.3 ounces.

We evaluate the random variable $\dfrac{(n-1)S^2}{\sigma^2}$ at the endpoints of the interval in question.

\begin{align} \dfrac{(n-1)s_1^2}{\sigma^2} &= \dfrac{(200-1)1.1^2}{1.2^2} \approx 167.22 \\ \dfrac{(n-1)s_2^2}{\sigma^2} &= \dfrac{(200-1)1.3^2}{1.2^2} \approx 233.55 \end{align}

The probability will be the area under the chi-square distribution between these values. We find

\begin{equation} \chi^2\operatorname{cdf} (167.22,233.55,199) = 0.9037 \end{equation}

There is a 90.37% probability that the standard deviation of the weights of the sample of 200 bags of flour will fall between 1.1 and 1.3 ounces.

It should be noted that although the mean of this chi-square distribution is the number of degrees of freedom, 199, the interval obtained is not symmetric about that value. In fact, 46.41% of the total area was on the left side of our interval, and 43.96% was on the right side of the interval.