Measures of Position

A measure of position is a method by which the position that a particular data value has within a given data set can be identified. As with other types of measures, there is more than one approach to defining such a measure.

Standard Scores

The standard score (often called the z-score) of a particular data value gives the position of that data value by determining its distance from the mean, in units of standard deviation. Therefore, the formula for the z-score when population data is involved is $z=\dfrac{x-\mu}{\sigma}$, and when sample data is involved is $z=\dfrac{x-\bar{x}}{s}$.

For example, suppose the grades on an economics exam averaged 78%, with a standard deviation of 5 percentage points. If Tom earned 69% on that exam, then his z-score was $z=\dfrac{69-78}{5} = \dfrac{-9}{5}=-1.8$. That is, he scored 1.8 standard deviations below the mean score.

To compute a standard score, only the mean and standard deviation are required. However, since both of those quantities do depend on every value in the data set, a small change in one data value will change every z-score.

You might notice that Chebyshev's Theorem also used the variable $z$. In fact, the meaning of $z$ in that theorem is the same as a standard score. When Chebyshev's Theorem states that at least 75% of the data fall within 2 standard deviations of the mean, then a score at the upper end of the interval $[\bar{x}-2s, \bar{x}+2s]$ will have standard score of $z=2$.

Tests which are nationally normed typically use a standardization process of this type, hence they are called standardized tests. College entrance exams are one example of such a test.

Quantiles

Percentiles, quartiles, and other quantiles are frequently used in statistics. A quantile is that (possibly hypothetical) data value for which the specified quantity of data falls below that value. For example, the 62nd percentile is the data value for which 62% of the data fall below, and the third quartile is the data value for which $\dfrac34$ of the data fall below.

Although quantiles have a fairly simple conceptual interpretation, computation is entirely another matter. Many different rules and formulas are in use, and none of them have become the overwhelming standard. If every data set was infinite and every variable continuous, then all of the rules and formulas in use would give the same result. Finite data sets and discrete variables generate issues usually not found in the infinite continuous case. Should a percentile give the percentage of data "at or below" rather than just "below"? Should the zeroth and hundredth percentiles be defined or undefined? In small data sets, how should the selection of an intermediate location be made (by rounding, or by interpolation)? Should the first quartile be equal to the 25th percentile?

The situation for quartiles is especially confusing. Here are several approaches.

The quartile $Q_i$ is located at position $\dfrac{i}{4}(n+1)$ of an ordered data set, and linear interpolation is used when the position is not an integer. This definition can be easily expanded to other quantiles, and leaves $Q_0$ and $Q_4$ undefined. Minitab software uses this approach.
The quartile $Q_i$ is located at position $\dfrac{i}{4}(n-1)+1$ of an ordered data set, and linear interpolation is used when the position is not an integer. This definition has a slightly messier formula, but is still easily expanded to other quantiles, and both $Q_0$ and $Q_4$ are defined (as the minimum and maximum, respectively). Microsoft Excel uses this approach.
The quartile $Q_i$ is located at position $\dfrac{i}{4}(n)+\dfrac12$ of an ordered data set, and linear interpolation is used when the position is not an integer. This definition can be easily expanded to other quantiles, and leaves $Q_0$ and $Q_4$ undefined. The results of this approach will be halfway between those provided by the previous two definitions. Mathematica software uses this approach. Some authors have adopted a similar approach, but sometimes using various rounding rules rather than linear interpolation.
Use the median to divide the ordered data set into two halves, and if the median is a data value omitting it from each half. Each quartile is located at the median of its half of the data set. This approach is not easily expanded to other quantiles (leaving open the possibility of inconsistencies between quartiles and percentiles), and leaves $Q_0$ and $Q_4$ undefined, but in a small data set it is quite easy to use. The line of Texas Instruments graphing calculators use this approach, as do many authors of modern textbooks.
Use the median to divide the ordered data set into two halves, and if the median is a data value including it with both halves. Each quartile is located at the median of its half of the data set. This approach is not easily expanded to other quantiles (leaving open the possibility of inconsistencies between quartiles and percentiles), and leaves $Q_0$ and $Q_4$ undefined, but in a small data set it is quite easy to use. Many authors of modern textbooks use this approach. Some authors have called these hinges to distinguish them from quartiles.

Personally, I like the first approach. By using position $\dfrac{i}{4}(n+1)$ for the quartile $Q_i$, and defining $\dfrac{i}{100}(n+1)$ for the percentile $P_i$, it is easy to observe the equality between $Q_1$ and $P_{25}$. Furthermore, $Q_0$ and $P_0$ give the same (undefined) result, as do $Q_4$ and $P_{100}$, which is in accord with the approach taken by publishers of standardized tests (whose extreme percentile rankings are $P_1$ and $P_{99}$ for the lowest and highest possible scores (which are usually achieved by a few individuals each year).

Suppose, for example, the daily high temperatures (in degrees Fahrenheit) in a two-week period in Topeka, Kansas were ${57,60,62,65,66,71,75,78,82,83,85,88,91,96}$. This set of data has 14 values.

The third quartile, $Q_3$, occurs at position $\dfrac34 (14+1) = \dfrac{45}{4}=11.25$. Interpolating between the eleventh and twelfth values, we have $x_{11} + 0.25 (x_{12}-x_{11}) = 85 + 0.25(88-85)=85.75$. Therefore, $Q_3 = 85.75$ degrees Fahrenheit is the third quartile.
The 17th percentile, $P_{17}$, occurs at position $\dfrac{17}{100}(14+1) = \dfrac{255}{100} = 2.55$. Interpolating between the second and third values, we have $x_2 + 0.55 (x_3-x_2) = 60 + 0.55(62-60) = 61.1$. Therefore, $P_{17} = 61.1$ degrees Fahrenheit is the 17th percentile.

The Five Number Summary

Although the mean and standard deviation are often used to summarize a set of data, much information is lost in this approach, particularly related to the shape of the graph. The five-number summary is a method that keeps information about the average, the dispersion, and the shape. The five numbers are the minimum, the first quartile, the median, the third quartile, and the maximum. In the example of Topeka's daily high temperatures, the five-number summary gives: 57, 64.25, 76.5, 85.75, 96.

A boxplot (sometimes called a box-and-whisker diagram) is a graph that provides the five-number summary for a finite data set in pictoral form. A rectangle is drawn with its ends at $Q_1$ and $Q_3$. The box is divided into two parts at the median. Whiskers are attached to the box that extend to the maximum and minimum values. The boxplot for Topeka's daily high temperatures is pictured below.

Some authors and software will not use the maximum and minimum for the ends of the whiskers. Rather, they will use a percentile somewhat short of the extremes, leaving any data beyond that simply as outlying points. This approach is actually necessary for data having an infinite range.

Identifying Outliers

Outliers are data values that fall beyond the typical distribution of data. Although there are numerical approaches to determine whether a suspicious data value is an outlier, one should still consider whether there might be a cause that makes a suspicious data value clearly different than the rest of the data. Often we exclude outliers from further analysis, and it is useful to have a non-numerical reason as to why a value was excluded.

When standard scores are used, a common approach is to identify any data value with $|z| >3$ as an outlier. In a normal distribution, data values of this sort happen just 0.3% of the time. Statistical process control often uses this definition to determine whether a process has veered out of control and needs adjusting.

For Topeka's daily high temperatures (in the above example), we find the mean is 75.64, and the (sample) standard deviation is 12.39. Three standard deviations from the mean gives $\bar{x}-3s=75.64-3(12.39)=38.47$ and $\bar{x}+3s=75.64+3(12.39)=112.81$. Since all of the data fell between those values, there were no outliers in the data.

When a boxplot is used, values that fall beyond locations called the fences are considered to be outliers. These fences depend upon a measure of dispersion called the interquartile range, or IQR. The necessary definitions of these quantities are:

Interquartile Range: $IQR = Q_3-Q_1$
Inner Fences: $Q_1 - 1.5 (IQR)$ and $Q_3 + 1.5 (IQR)$
Outer Fences: $Q_1 - 3 (IQR)$ and $Q_3 + 3 (IQR)$

Values that fall beyond the outer fences are considered to be serious outliers. Those that fall beyond the inner fences but within the outer fences are possible outliers.

In the previous example about Topeka's daily high temperatures, the interquartile range was $IQR=85.75-64.25=21.5$ Therefore, the inner fences are located at $64.25-1.5(21.5)=32$ and $85.75+1.5(21.5)=118$, and the outer fences are located at $64.25-3(21.5)=-0.5$ and $85.75+3(21.5)=150.25$. None of the data in the original set fell beyond the inner fences, so there were no outliers.