Measures of Dispersion

Averages cannot tell the entire story of a data set. Consider, for example, the following three data sets, giving the heights of the starters of three high school basketball teams.

Bulls	$\qquad 70", 71", 72", 73", 74"$
Eagles	$\qquad 66", 71", 72", 73", 78"$
Pythons	$\qquad 66", 67", 72", 77", 78"$

Each of these data sets has a mean of 72 inches, a median of 72 inches, and no mode. But the heights of the players on these teams are not identical. The players on the Bulls are quite alike in height, but the heights of the other teams are more spread out. We would like to measure the spread, or dispersion, of the heights.

From Range to Standard Deviation

The most obvious way to measure spread would be to calculate the difference between the extreme values of the set. This is called the range, and it is found by subtracting the minimum from the maximum. For the Bulls, the range is 4 inches, and for the Eagles and Pythons the range is 12 inches. So we see the range did distinguish one of the three teams, but not the other two.

Caution: The range used in statistics is not the same as the range of a function from algebra and calculus. The range of a function is an interval of $y$-values, and a possible answer might be $[66,78]$. But the range in statistics is a single value equal to the difference of these, that is, 12 inches. In statistics, we always want the single-valued range.

The range did not distinguish between all three teams because it only took into account two of the values on each team, and ignored all of the rest of the values. If we want to consider all of the values, we need a quantity that will use all of them in the computation. To do that, we first consider the deviation, or difference from the mean. Symbolically, the deviation is the quantity $x_i - \bar{x}$ , if sample data is being used, or $x_i-\mu$, if population data is being used. Here are the deviations for the three teams.

Bulls	$\qquad -2", -1", 0", 1", 2"$
Eagles	$\qquad -6", -1", 0", 1", 6"$
Pythons	$\qquad -6", -5", 0", 5", 6"$

To obtain a single number, we might decide to find an average of the deviations. But if we compute either the mean or the median of these numbers, we will always obtain zero. And just as the original data, the deviations also have no mode. We undoubtedly recognize that the presence of the negative signs is what has caused this result.

What if we ignore the negative signs? Or more mathematically, take absolute values? We can then obtain a quantity called the mean absolute deviation. The formula for this quantity is $\dfrac{\sum\limits_{i=1}^n |x_i-\bar{x}|}{n}$, if sample data is used, with a similar formula for population data. Computing the mean absolute deviation of the starter heights for each team, we get 1.2 inches for the Bulls, 2.8 inches for the Eagles, and 4.4 inches for the Pythons. The mean absolute deviation does distinguish between the three data sets.

However, in the formula of the mean absolute deviation their lurks a difficulty. It uses absolute values. When dealing with absolute values in former math classes, what did you usually have to do? Probably you had to break the problem into cases. That is not a direction we really want to go, so we will set aside this formula.

Another approach to removing the negative signs is to square each of the quantities. This approach yields two measures of dispersion, one called the variance, and its square root that is called the standard deviation. The formulas are as follows:

	Population Formula	Sample Formula
Variance	$\sigma^2=\dfrac{\sum\limits_{i=1}^N (x_i-\mu)^2}{N}$	$s^2=\dfrac{\sum\limits_{i=1}^n (x_i-\bar{x})^2}{n-1}$
Standard Deviation	$\sigma=\sqrt{\dfrac{\sum\limits_{i=1}^N (x_i-\mu)^2}{N}}$	$s=\sqrt{\dfrac{\sum\limits_{i=1}^n (x_i-\bar{x})^2}{n-1}}$

Besides the use of sample or population formulas for the mean and the size of the data set, there is another very important difference. The sample formulas have a different denominator. That is because the sample formula is intended to be used as an estimator for the population parameter, and to obtain an unbiased estimator (one that will not be too small on the average) it is necessary to use the smaller denominator.

Let us compute the variance and standard deviation of the heights for the population of the starters on the Eagles. The computation is:

$\sigma^2=\dfrac{\sum\limits_{i=1}^N (x_i-\mu)^2}{N} =\dfrac{(66"-72")^2+(71"-72")^2+(72"-72")^2+(73"-72")^2+(78"-72")^2}{5}$

$\qquad =\dfrac{((-6)^2+(-1)^2+0^2+1^2+6^2) \text{ square inches}}{5} =\dfrac{(36+1+0+1+36) \text{ square inches}}{5}$

$\qquad =\dfrac{74}{5}\text{ square inches}=14.8 \text{ square inches}$

$\sigma = \sqrt{\dfrac{74}{5}\text{ square inches}} = \dfrac{\sqrt{370}}{5}\text{ inches} \approx 3.85 \text{ inches}$

So the population variance for the Eagles is 14.8 square inches, and the population standard deviation for the Eagles is approximately 3.85 inches. If we had used the sample formulas, we would have obtained a variance of 18.5 inches, and a standard deviation of approximately 4.30 inches.

Having kept track of the units, we can see why the square root proved necessary. We were attempting to find a way to measure average distance, so we needed a result whose units would be distance. Variance produced a unit of area, not of distance. Even though the units are different, variance does occur often enough in a study of statistics to warrant its own name.

The Standard Deviation of a Frequency Distribution

When computing the standard deviation from a frequency distribution, a weighted formula is required. The formulas are as follows:

	Population Formula	Sample Formula
Variance	$\sigma^2=\dfrac{\sum\limits_{i=1}^N w_i (x_i-\mu)^2}{\sum\limits_{i=1}^N w_i}$	$s^2=\dfrac{\sum\limits_{i=1}^n w_i(x_i-\bar{x})^2}{\left(\sum\limits_{i=1}^n w_i\right)-1}$
Standard Deviation	$\sigma=\sqrt{\dfrac{\sum\limits_{i=1}^N w_i(x_i-\mu)^2}{\sum\limits_{i=1}^N w_i}}$	$s=\sqrt{\dfrac{\sum\limits_{i=1}^n w_i(x_i-\bar{x})^2} {\left(\sum\limits_{i=1}^n w_i\right)-1}}$

Suppose the heights of 169 freshmen at Western High School were found, and the results provided in the following table.

Height	Number of Students
135-149 cm	23
150-164 cm	36
165-179 cm	29
180-194 cm	64
195-209 cm	17

This is the same example that we used on the page Measures of Central Tendency. There, we found the mean $\bar{x} =\frac{29308}{169} \approx 173.42$ cm. Now we compute the standard deviation, showing the details in the following table:

Height	Class Midpoints $x_i$	Deviations $x_i - \bar{x}$	Number of Students $w_i$	Weighted Squared Deviations $w_i (x_i - \bar{x})^2$
135-149 cm	142 cm	$-31.42$ cm	23	22705.98 cm²
150-164 cm	157 cm	$-16.42$ cm	36	9706.19 cm²
165-179 cm	172 cm	$-1.42$ cm	29	58.48 cm²
180-194 cm	187 cm	$13.58$ cm	64	11802.65 cm²
195-209 cm	202 cm	$28.58$ cm	17	13885.88 cm²
Totals			169	58159.18 cm²

We then have $s^2=\dfrac{58159.18}{169} \approx 344.14$ cm², which gives $s \approx \sqrt{344.14} \approx 18.55$ cm.

The Standard Deviation and the Spread of the Data

The standard deviation certainly tells us something about an average distance data values fall from the mean. But an average also places restrictions on where data can fall. The following theorem quantifies that restriction.

Chebyshev's Theorem: For any distribution of data having a finite mean and a finite standard deviation, at least $1-\dfrac{1}{z^2}$ of the data will fall within $z$ standard deviations of the mean.

Let us test this result on the starter heights of the Eagles, using 2 standard deviations. We had earlier found that the Eagles had a mean height of 72 inches, and a standard deviation of approximately 3.85 inches. There are three stages to the verification.

For $z=2$, the theorem says that at least $1-\dfrac{1}{2^2}=0.75=75\%$ of the data will fall within the prescribed interval.
The values that fall within two standard deviations will be greater than $\mu-2\sigma=72-2(3.85)=64.3$ inches, and less than $\mu+2\sigma=72+2(3.85)=79.7$ inches.
Checking the actual data, we find that $\dfrac55=100\%$ of the Eagle starter heights fall within the interval $[64.3,79.7]$, and 100% is at least 75%.

Chebyshev's Theorem is always true for any data set, and it places a restriction on how spread out the data will be. On the other hand, the formula does not provide any useful information for $z\le 1$, since it will tell us that at least 0% of the data will fall within that interval. But this is to be expected from a formula that works for every possible data set. There can be a lot of variety, and Chebyshev's Theorem must accommodate all of that variety. If we restrict the variety possible, we can obtain a more specific result. For a proof, see Expected Value and Variance Properties.

Empirical Rule: If a data set is approximately normally distributed (bell-shaped), then

about 68% of the data will fall within 1 standard deviation of the mean
about 95% of the data will fall within 2 standard deviations of the mean
about 99.7% of the data will fall within 3 standard deviations of the mean

Graph of Normal Distribution showing 68% of data
within one standard deviation of mean

A justification for this result can be found in Normal Distributions.

A great many physical phenomena that allow variation will be approximately normally distributed, so this is a very useful result. When testing a given data set, all three statements of the Empirical Rule must be found reasonable to conclude that the Empirical Rule is satisfied. Doing this for the starter heights of the Eagles, we find

Data within one standard deviation of the mean are in the interval $[\mu-\sigma,\mu+\sigma] \approx [68.15, 75.85]$. Checking the actual data, $\dfrac35 = 60\%$ fall in this interval, which is somewhat less than 68%.
Within two standard deviations, the data will be in the interval $[\mu-2\sigma,\mu+2\sigma] \approx[64.3,79.7]$. Checking the actual data, we have $\dfrac55=100\%$ fall in the interval, rather more than 95%.
For three standard deviations, we use the interval $[\mu-3\sigma,\mu+3\sigma] \approx[60.45,83.55]$. Once again, 100% fall within this interval, which is reasonably close to 99.7%.

The percentages obtained from the starter heights of the Eagles are somewhat close, but not really close. Based on those results, we should have doubts about whether the Empirical Rule applies. Graphing the data would suggest that we do have something similar to a bell-shaped curve, but not near as smooth a curve.

The Coefficient of Variation

Standard deviations are measured in the same unit as the data, since the conceptual idea is an average "distance" from the mean. But if you want to compare the spreads of two very different data sets, you need to have units which are identical. This can be done with the coefficient of variation, given by the formula $\dfrac{\sigma}{\mu}\times 100\%$, or the equivalent formula using sample data. The coefficient of variation is a ratio, and therefore unitless.

For example, suppose we know that teacher salaries in a local school district have $\mu=\$58,000$ and $\sigma=\$6,000$. We also know that a sample of human birth weights finds $\bar{x}=114\text{ ounces}$ and $s=21\text{ ounces}$. Computing the coefficients of variation, we find a ratio of $\dfrac{6000}{58000}\times 100\%=10.34\%$ for the teacher salaries, and $\dfrac{21}{114}\times 100\%=18.42\%$ for birth weights. In this case, the birth weights of humans show more spread than the salaries of the teachers.