Confidence Intervals for Percentiles and Medians

Certain assumptions were required in order to be able to determine a confidence interval for a mean. In particular, we needed to have either a large sample size, or know that the original population was normal. If neither of these is true, we cannot produce a confidence interval for a mean. But we can still produce a confidence interval for a median (the 50th percentile), or for any other percentile.

The Theory

Suppose we have a population whose distribution is completely unknown. Nevertheless, each value in that population will be related to a particular percentile. That is, value $x$ occurs at the $100p$-th percentile (denoted $P_{100p}$) whenever $100p\% = p$ of the data falls below $x$. Of course, that also means that $1-p$ of the data falls above $x$.

Let us suppose we have a sample of size $n$, with data values ${x_1, x_2, x_3, ..., x_n}$. We can consider each item in the sample as a random variable, so we have a collection of $n$ random variables, ${X_1, X_2, X_3, ..., X_n}$. It would be highly unlikely that the sample values happened to be in ascending order. Therefore, we also consider the order statistics of the sample, whose values are the random variables of the ordered data, and we shall denote these random variables as ${Y_1, Y_2, Y_3, ..., Y_n}$. We will need to single out the intervals between these values, so we identify the $k$-th order statistic, $Y_k$, and its successor, $Y_{k+1}$. These values occur in the midst of the set of random variables, so that we have ${Y_1, Y_2, Y_3, ..., Y_k, Y_{k+1}, ..., Y_n}$.

Let us now consider the probability that the $100p$-th percentile, $P_{100p}$, falls in the interval between $Y_k$ and $Y_{k+1}$. That is, we want the probability $P(Y_k \le P_{100p} \le Y_{k+1})$. We observe that there are two possible results for each sample value, either it will fall below the $100p$-th percentile, or it will not. The probability that any particular sample value falls below the $100p$-th percentile is simply $p$, and is fixed. The number of trials is fixed at the sample size $n$. And if the values of the sample are randomly selected from the population, then we can assume that the sample values are independent. These are the four conditions for a probability to have a binomial distribution. In other words, we have the following formula.

$P(Y_k \le P_{100p} \le Y_{k+1}) = ({}_n C_k) (p^k) (1-p)^{n-k}$

The probability given by this formula will be the confidence level that the $100p$-th percentile falls between the $k$-th and $(k+1)$-th values in the sample of data. Since most binomial probabilities are quite small, this would give us very little confidence, so we generally need a larger interval. The following formula will give the confidence level that the $100p$-th percentile falls between the $j$-th and $k$-th values of the ordered data.

$P(Y_j \le P_{100p} \le Y_{k-1}) = \sum\limits_{i=j}^{k-1} ({}_n C_i) (p^i) (1-p)^{n-i}$

Note that $i=0$ is a possible value in the binomial formula, and this is equivalent to the percentile falling below the first value in the ordered sample data. Similarly, the value $i=n$ in the binomial formula produces the probability that the percentile falls above the last value in the ordered data.

Example

Authorities are examining fatalities from a recent flu epidemic, and have a sample of 9 people whose ages (in years) were

24, 38, 61, 22, 16, 57, 31, 29, 35

Find 80% confidence intervals for both the median and the third quartile of age at the time of death from this epidemic.

The ordered data is

16, 22, 24, 29, 31, 35, 38, 57, 61

For the median, we will use $p=0.5$, and for the third quartile we will use $p=0.75$. We obtain the two binomial distributions for these percentiles.

Median, $p = 0.5$

$X = i$	$P(X = i)$
0	0.0020
1	0.0176
2	0.0703
3	0.1641
4	0.2461
5	0.2461
6	0.1641
7	0.0703
8	0.0176
9	0.0020

Third Quartile, $p = 0.75$

$X = i$	$P(X = i)$
0	0.000004
1	0.0001
2	0.0012
3	0.0087
4	0.0389
5	0.1168
6	0.2336
7	0.3003
8	0.2253
9	0.0751

To build an 80% confidence interval, we need to choose probabilities whose sum will exceed 80%. In order to have as much precision in our interval as possible, we shall include the largest probabilities first, so as to have as few subintervals as possible.

For the median, we choose the values 3 through 6 for $i$, giving us the probability $0.1641 + 0.2461 + 0.2461 + 0.1641 = 0.8204$. Looking back at the ordered data, our first subinterval begins with the third value, 24. The last subinterval begins with the 6th value and ends at the 7th value, 38. So the 80% confidence interval for the median is $[24,38]$. We can be 80% confident that the median age at death from the epidemic was between 24 and 38 years. (Actually, we can be 82% confident with this interval.)

For the third quartile, we choose the values 5 through 8 for $i$, giving the probability $0.1168 + 0.2336 + 0.3003 + 0.2253 = 0.8760$. From the ordered data, we see that this interval will begin at the fifth value, 31, and end at the ninth value, 61. So the 80% confidence interval for the third quartile is $[31,61]$. We can be 80% confident that the third quartile age at death from the epidemic was between 31 and 61 years. (And we can actually be 87% confident with this interval.)

Approximating with the Normal Distribution

Since percentiles are distributed according to the binomial distribution, and binomial distributions are approximately normal, we can conclude that there are certain times when the normal distribution can be used to find the confidence interval of a median. These correspond to the situations when $np \ge 5$ and $n(1-p) \ge 5$.

Suppose we want a 90% confidence interval for the median credit card balance of an American consumer. We sample 15 individuals from the American population, and obtain the following balances (in dollars).

295, 3147, 283, 569, 1141, 788, 1255, 2038, 978, 548
1133, 1641, 959, 816, 955, 1473, 702, 459, 1844

The ordered data is

283, 295, 459, 548, 569, 702, 788, 816, 955, 959
978, 1133, 1141, 1255, 1473, 1641, 1844, 2038, 3147

For the median, we have a binomial distribution with $p=0.5$. This data has sample size $n = 19$. Since $np = n(1-p) = 19(0.5) = 9.5$, we can approximate the binomial distribution with a normal distribution, with $\mu = np = 9.5$ and $\sigma = \sqrt{np(1-p)} = \sqrt{19(0.5)(0.5)} \approx 2.18$. These values are the mean and standard deviation of the subinterval positions in the ordered data.

A 90% confidence interval will have $\alpha = 0.10$, so $z_{\alpha/2} = z_{0.05} = 1.645$. Then our confidence interval is given by $\mu \pm z_{\alpha/2} \sigma = 9.5 \pm (1.645)(2.18) = 9.5 \pm 3.6$, which is the interval $[5.9, 13.1]$. Since the subinterval positions are discrete data and the normal distribution is continuous, a continuity correction occurs, and we enlarge this interval to have endpoints at the next half-integer, $[5.5,13.5]$. This interval is equivalent to having the discrete subinterval positions from the sixth to the thirteenth. But the thirteenth subinterval actually ends at the fourteenth data value, so our confidence interval will run from the sixth to the fourteenth data values. In terms of the actual (ordered) data, we can be 90% confident that the median credit card balance falls between 702 and 1255 dollars.