Hypergeometric Distributions

Many statistical experiments involve a fixed number of selections without replacement from a finite population. If the random variable $X$ represents the number of "successes", then $X$ has a hypergeometric distribution.

The Formulas

In a hypergeometric distribution, if $N$ is the population size, $n$ is the number of trials, $s$ is the number of successes available in the population, and $x$ is the number of successes obtained in the trials, then the following formulas apply.

\begin{align} P(x) &= \dfrac{ \left( {}_s C_x \right) \left( {}_{N-s} C_{n-x} \right)}{{}_N C_n } \\ E(X) &= n\left( \dfrac{s}{N} \right) \\ Var(X) &= n\left( \dfrac{s}{N} \right) \left(1- \dfrac{s}{N} \right)\left(\dfrac{N-n}{N-1}\right) \end{align}

Each of the factors in the formulas above can be conceptually interpreted. In the probability formula, the factors in the numerator are the number of ways to obtain $x$ successes and the number of ways to obtain $(n-x)$ failures. The denominator is the number of ways any result could occur. The expected value formula is very similar to the binomial result $E(X)=np$, in that the factor $\dfrac{s}{N}$ is the probability that the first trial will result in a success. Similarly, in the variance formula, the first three factors are equivalent to the factors for the variance of a binomial distribution. The fourth factor is often called a correction factor, due to the fact that the hypergeometric is sampling without replacement from a finite population.

A moment generating function does exist for the hypergeometric distribution. However, it is described in terms of a special function known as a hypergeometric function, so we will not be using it to determine the moments of the function.

Drawing Cards from the Deck

In the game of bridge, a player receives 13 of the 52 cards from the deck. What is the probability that three of the cards are aces? How many aces should we expect, and what is the standard deviation for the number of aces?

Since we are interested in aces, then a success is an ace. The population of cards is finite, with $N=52$. The number of selections is fixed, with $n=13$, and they are without replacement. The number of aces available to select is $s=4$. These are the conditions of a hypergeometric distribution.

To determine the probability that three cards are aces, we use $x=3$. We find $P(x) = \dfrac{ \left( {}_{4} C_{3} \right) \left( {}_{48} C_{10} \right)}{{}_{52} C_{13} } \approx 0.0412$. The expected value is given by $E(X)= 13\left(\dfrac{4}{52}\right) = 1$ ace. The standard deviation is $\sigma = \sqrt{ 13\left(\dfrac{4}{52}\right)\left(\dfrac{48}{52}\right) \left(\dfrac{39}{51}\right)} \approx 0.8402$ aces.

Sampling from a Small Population

Suppose that in a population of 500 individuals, 150 favor the death penalty and 350 do not. If 22 people are randomly selected, what is the probability that exactly 7 favor the death penalty? How many should we expect to favor the death penalty, and what is the standard deviation?

In this problem, a success is an individual who favors the death penalty. The population is finite, with $N=500$. The number of selections is fixed, with $n=22$, and they are without replacement. The number of individuals who favor the death penalty is $s = 150$. Thus, the conditions of the hypergeometric distribution have been satisfied.

To obtain 7 people who favor the death penalty, we want $x=7$. We then have $P(x) = \dfrac{ \left( {}_{150} C_7 \right) \left( {}_{350} C_{15} \right)}{{}_{500} C_{22} } \approx 0.1811$. The expected value is $E(X)= 22\left(\dfrac{150}{500}\right) = 6.6$ individuals who favor the death penalty, and the standard deviation is $\sigma = \sqrt{ 22\left(\dfrac{150}{500}\right)\left(\dfrac{350}{500}\right) \left(\dfrac{478}{499}\right)} \approx 2.1037$ people.

Derivation of the Formulas

The probability formula is generated by a counting argument. The number of ways to obtain $x$ successes when $s$ are available is ${}_s C_x$. The number of ways to obtain $(n-x)$ failures when $(N-s)$ are available is ${}_{N-s} C_{n-x}$. Their product is the number of ways to achieve exactly $x$ successes in $n$ trials. The total number of ways to obtain any result with $n$ trials in a population of size $N$ is ${}_N C_n$. Therefore, the probability of exactly $x$ successes is given by $P(X=x) = \dfrac{ \left( {}_s C_x \right) \left( {}_{N-s} C_{n-x} \right)}{{}_N C_n }$.

Once again, we will have need of the binomial theorem, $(x+y)^n = \sum\limits_{k=0}^n \left({}_n C_k\right) x^k y^{n-k}$. But before proving the formula for the expected value, we need the following lemma: ${}_{m+n} C_r = \sum\limits_{t=0}^r \left({}_m C_{r-t} \right)\left({}_n C_t \right)$. The argument begins with an equality by the rules of exponents, uses the binomial theorem, and then rewrites the quantities into a double sum.

\begin{align} (x+y)^{m+n} &= (x+y)^m (x+y)^n \\ \sum\limits_{r=0}^{m+n} \left({}_{m+n} C_r \right) x^r y^{m+n-r} &= \left[ \sum\limits_{s=0}^m \left({}_m C_s \right) x^s y^{m-s} \right] \left[ \sum\limits_{t=0}^n \left({}_n C_t \right) x^t y^{n-t} \right] \\ \sum\limits_{r=0}^{m+n} \left({}_{m+n} C_r \right) x^r y^{m+n-r} &= \sum\limits_{s=0}^m \sum\limits_{t=0}^n \left({}_m C_s \right)\left({}_n C_t \right) x^{s+t} y^{m+n-s-t} \end{align}

Now we want to find the like terms. From the exponent on the variable $x$, we must have $r=s+t$, and we note that on both sides of the equation, we will have $0 \leq r \leq m+n$. Furthermore, $t=r-s$, and since $s$ is nonnegative, then $t \leq r$. Therefore, we can rewrite the right-hand side of the equation, then equate coefficients.

\begin{align} \sum\limits_{r=0}^{m+n} \left({}_{m+n} C_r \right) x^r y^{m+n-r} &= \sum\limits_{r=0}^{m+n} \left[ \sum\limits_{t=0}^r \left({}_m C_{r-t} \right) \left({}_n C_t \right) \right] x^r y^{m+n-r} \\ {}_{m+n} C_r &= \sum\limits_{t=0}^r \left({}_m C_{r-t} \right) \left({}_n C_t \right) \end{align}

The formula for the expected value of a hypergeometric distribution arises from the generic formula for the expected value.

\begin{align} E(X) &= \sum\limits_{x=0}^n x P(x) \\ &= \sum\limits_{x=0}^n \dfrac{ x \left( {}_s C_x \right) \left( {}_{N-s} C_{n-x} \right)}{{}_N C_n } \\ &= \sum\limits_{x=0}^n \left({}_{N-s} C_{n-x} \right) \dfrac{x s!}{x! (s-x)!} \dfrac{n! (N-n)!}{N!} \\ &= \sum\limits_{x=1}^n \left({}_{N-s} C_{n-x} \right) \dfrac{s (s-1)!}{(x-1)! (s-x)!} \dfrac{n(n-1)!(N-n)!}{N(N-1)!} \\ &= \dfrac{n s}{N} \sum\limits_{x=1}^n \dfrac{ \left({}_{N-s} C_{n-x} \right) \left({}_{s-1} C_{x-1} \right) }{{}_{N-1} C_{n-1}} \\ &= \dfrac{n s}{N} \dfrac{1}{{}_{N-1} C_{n-1}} \sum\limits_{x=1}^n \left({}_{N-s} C_{n-x} \right) \left({}_{s-1} C_{x-1} \right) \\ &= \dfrac{n s}{N} \dfrac{1}{{}_{N-1} C_{n-1}} \sum\limits_{x=0}^{n-1} \left({}_{N-s} C_{n-1-x} \right) \left({}_{s-1} C_x \right) \\ &= \dfrac{n s}{N} \dfrac{{}_{N-1} C_{n-1}}{{}_{N-1} C_{n-1}} \\ &= \dfrac{n s}{N} \end{align}

Before we can produce the variance formula, we first need a formula for $E(X^2)$.

\begin{align} E(X^2) &= \sum\limits_{x=0}^n x^2 P(x) \\ &= \sum\limits_{x=0}^n [ x(x-1) + x] P(x) \\ &= \sum\limits_{x=0}^n x(x-1)P(x) + \sum\limits_{x=0}^n xP(x) \\ &= \sum\limits_{x=0}^n x(x-1) \dfrac{ \left( {}_s C_x \right) \left( {}_{N-s} C_{n-x} \right)}{{}_N C_n } + \mu \\ &= \sum\limits_{x=2}^n x(x-1) \dfrac{ s(s-1)(s-2)!}{x(x-1)(x-2)!(s-x)!} \left( {}_{N-s} C_{n-x} \right) \dfrac{n(n-1)(n-2)!(N-n)!}{N(N-1)(N-2)!} + \mu \\ &= \sum\limits_{x=2}^n \dfrac{s(s-1)n(n-1)}{N(N-1)} \dfrac{ \left( {}_{s-2} C_{x-2} \right) \left( {}_{N-s} C_{n-x} \right) }{ {}_{N-2} C_{n-2}} + \mu \\ &= \dfrac{s(s-1)n(n-1)}{N(N-1)} \dfrac{1}{{}_{N-2} C_{n-2}} \sum\limits_{x=2}^n \left( {}_{s-2} C_{x-2} \right) \left( {}_{N-s} C_{n-x} \right) + \mu \\ &= \dfrac{s(s-1)n(n-1)}{N(N-1)} \dfrac{1}{{}_{N-2} C_{n-2}} \sum\limits_{x=0}^{n-2} \left( {}_{s-2} C_x \right) \left( {}_{N-s} C_{n-2-x} \right) + \mu \\ &= \dfrac{s(s-1)n(n-1)}{N(N-1)} \dfrac{{}_{N-2} C_{n-2}}{{}_{N-2} C_{n-2}} + \mu \\ &= \dfrac{s(s-1)n(n-1)}{N(N-1)} + \dfrac{n s}{N} \end{align}

Now we can deal with the variance formula.

\begin{align} Var(X) &= E(X^2)-(E(X))^2 \\ &= \dfrac{s(s-1)n(n-1)}{N(N-1)} + \dfrac{n s}{N} - \dfrac{n^2 s^2}{N^2} \\ &= \dfrac{n s}{N} \left[ \dfrac{(s-1)(n-1)}{N-1} + 1 - \dfrac{n s}{N} \right] \\ &= \dfrac{n s}{N} \left[\dfrac{ N(s-1)(n-1)+N(N-1) -ns(N-1)}{N(N-1)} \right] \\ &= \dfrac{n s}{N} \left[\dfrac{ N^2 - Nn - Ns + ns}{N(N-1)} \right] \\ &= \dfrac{n s}{N} \dfrac{ (N-s)(N-n)}{N(N-1} \\ &= \dfrac{n s}{N} \left( 1-\dfrac{s}{N}\right) \left(\dfrac{N-n}{N-1}\right) \end{align}

And this result implies that the standard deviation of a hypergeometric distribution is given by $\sigma = \sqrt{ \dfrac{n s}{N} \left( 1-\dfrac{s}{N}\right) \left(\dfrac{N-n}{N-1}\right)}$.

Approximating with a Binomial Distribution

The hypergeometric distribution describes the probabilities when sampling without replacement. When items are not replaced, the probability of a success will change at each trial, and the trials are not independent. However, as the population size increases without bound, the hypergeometric distribution converges to a binomial distribution, for which the probabilities are constant, and the trials are independent. Therefore, when we sample from a very large population, we frequently assume that the conditions of a binomial distribution are met, rather than doing the more difficult computations provided by the hypergeometric distribution.

The mathematical argument to justify the approximation of the hypergeometric distribution with the binomial distribution proceeds as follows.

\begin{align} P(X=x) &= \dfrac{ \left( {}_s C_x \right) \left( {}_{N-s} C_{n-x} \right)}{{}_N C_n } \\ &= \dfrac{s!}{x! (s-x)!} \dfrac{(N-s)!}{(n-x)! (N-s-n+x)!} \dfrac{n! (N-n)!}{N!} \\ &= \dfrac{n!}{x! (n-x)!} \dfrac{s!}{(s-x)!} \dfrac{(N-s)!}{(N-s-n+x)!} \dfrac{(N-n)!}{N!} \\ &= \left( {}_n C_x \right) \dfrac{ [s(s-1) \cdots (s-x+1)] [(N-s)(N-s-1) \cdots (N-s-n+x+1)]} {N(N-1) \cdots (N-n+1)} \\ &= \dfrac{ \left( {}_n C_x \right) \left[ \left(\dfrac{s}{N}\right) \left(\dfrac{s-1}{N}\right) \cdots \left(\dfrac{s-x+1}{N}\right) \right] \left[ \left(\dfrac{N-s}{N}\right) \left(\dfrac{N-s-1}{N}\right) \cdots \left(\dfrac{N-s-n+x+1}{N}\right) \right] }{ \left(\dfrac{N}{N}\right) \left(\dfrac{N-1}{N}\right) \cdots \left(\dfrac{N-n+1}{N}\right) } \\ &= \dfrac{ \left( {}_n C_x \right) \left[ p \left(p-\dfrac{1}{N}\right) \cdots \left(p-\dfrac{x-1}{N}\right) \right] \left[ (1-p) \left(1-p-\dfrac{1}{N}\right) \cdots \left(1-p-\dfrac{n-x-1}{N}\right) \right] }{1 \left(1-\dfrac{1}{N}\right) \cdots \left(1-\dfrac{n-1}{N}\right) } \end{align}

In the last line above, we set $p=\dfrac{s}{N}$, so that the probability of a success on the first trial of the hypergeometric process is equivalent to the probability of a success in the binomial process. Now, with both the number of trials and the number of successes being fixed (that is, with $n$ and $x$ held fixed), we can consider what happens as the population size $N$ approaches infinity.

\begin{align} \lim_{N \rightarrow \infty} P(X=x) &= \lim_{N \rightarrow \infty} \dfrac{ \left( {}_n C_x \right) \left[ p \left(p-\dfrac{1}{N}\right) \cdots \left(p-\dfrac{x-1}{N}\right) \right] \left[ (1-p) \left(1-p-\dfrac{1}{N}\right) \cdots \left(1-p-\dfrac{n-x-1}{N}\right) \right] } {1 \left(1-\dfrac{1}{N}\right) \cdots \left(1-\dfrac{n-1}{N}\right) } \\ &= \dfrac{ \left( {}_n C_x \right) p^x (1-p)^{n-x} }{ 1^n} \\ &= \left( {}_n C_x \right) p^x (1-p)^{n-x} \end{align}