Geometric Distributions

Instead of counting the number of successes, we can also count the number of trials until a success is obtained. That is, we shall let the random variable $X$ represent the number of trials needed to obtain the first success. In this situation, the number of trials will not be fixed. But if the trials are still independent, only two outcomes are available for each trial, and the probability of a success is still constant, then the random variable will have a geometric distribution.

The Formulas

In a geometric distribution, if $p$ is the probability of a success, and $x$ is the number of trials to obtain the first success, then the following formulas apply.

\begin{align} P(x) &= p (1-p)^{x-1} \\ M(t) &= p (e^{-t} - 1 + p)^{-1} \\ E(X) &= \dfrac{1}{p} \\ Var(X) &= \dfrac{1-p}{p^2} \end{align}

Repeatedly Rolling a Die

What is the probability that a four will first appear on the fifth roll? How many rolls should we expect to need to obtain the first four, and what is the standard deviation for the number of rolls?

Since we are interested in "fours", then a success is a four. There are two outcomes on each die, namely "fours" and "not fours". The probability of a success is $p=\dfrac16$, and is constant. The trials are independent, and we are interested in the number of rolls until the first success. Therefore all of the conditions for using the geometric distribution have been met.

To determine the probability that five rolls will be needed to obtain the first four, we use $x=5$. This gives $P(X=5) = \dfrac16 \times \left( \dfrac56 \right)^4 = \dfrac{625}{7776} \approx 0.0804$. The expected value is $E(X) = \dfrac{1}{1/6} = 6$ rolls, and the standard deviation is $\sigma = \sqrt{ \dfrac {5/6}{(1/6)^2}} = \sqrt{30} \approx 5.48$ rolls.

Sampling from a Very Large Population

Approximately 44% of all Americans have blood type O. What is the probability that exactly 3 people need to be sampled in order to find someone who has blood type O? How many people should we expect to need to sample to find someone having that blood type, and what is the standard deviation?

In this problem, a success is an individual with blood type O, and the other outcome are those who do not have blood type O. The probability of a success is $p=0.44$. We will actually be sampling without replacement, but since the population of the USA is millions of times greater than the size of the sample, we can assume the probabilities are essentially constant. As long as we randomly sample from the entire population, and not a small group that makes it likely that we would choose relatives, we can assume the trials would be independent. Also, we are interested in the number of samples needed to produce one success. Therefore, the conditions for using the geometric distribution have been basically met (due to the large size of the population compared to the sample).

If we need 3 people in order to find someone with blood type O, we want $x=3$. We then have $P(X=3) = 0.44 (0.56)^2 \approx 0.1380$. The expected value is $E(X) = \dfrac{1}{0.44} = 2.27$ people to find one with blood type O, and the standard deviation is $\sigma = \sqrt{ \dfrac{0.56}{0.44^2}} \approx 1.70$ people to sample.

Derivation of the Formulas

In the following derivation, we will make use of the sum of a geometric series formula from college algebra. It says $\sum\limits_{n=0}^{\infty} r^n = \dfrac{1}{1-r}$, as long as the ratio satisfies the inequality $-1 < r < 1$.

Now imagine a scenario where $x$ trials are needed to obtain the first success. We shall identify the successes by S, and the failures by F. There is only one possible arrangement for this situation, namely FF...FS. That is, the one success must come last, and has probability $p$. Each failure has probability $(1-p)$, and there are $(x-1)$ failures, so the probability of the failures is $(1-p)^{x-1}$. The probability that $x$ trials are needed for the first success is therefore the product of these two factors. Therefore, we have $P(x)=p (1-p)^{x-1}$.

The moment generating function $M(t)$ is found by evaluating $E(e^{tX})$.

\begin{align} M(t) &= E(e^{tX}) = \sum\limits_{x=1}^{\infty} e^{tx} p (1-p)^{x-1} \\ &= \sum\limits_{x=1}^{\infty} \dfrac{p}{1-p} \left[ e^t (1-p) \right]^x \\ \end{align}

On the right hand side of the equation, we note that a geometric series has arisen. The sequence has first term $pe^t$ and ratio $e^t(1-p)$. Using the formula for the infinite sum, we obtain

\begin{equation} M(t) = \dfrac{pe^t}{1-e^t(1-p)} = p (e^{-t} - 1 + p)^{-1} \end{equation}

The expected value, $E(X)$, can be found from the first derivative of the moment generating function.

\begin{align} M'(t) &= -p (e^{-t} - 1 + p)^{-2} (-e^{-t}) = (e^{-t} - 1 + p)^{-2} pe^{-t} \\ E(X) &= M'(0) = p^{-2} p = \dfrac1p \end{align}

The formulas for $E(X^2)$ and $Var(X)$ are obtained from the second derivative of $M(t)$.

\begin{align} M''(t) &= -2 (e^{-t} - 1 + p)^{-3} (-e^{-t}) pe^{-t} + (e^{-t} - 1 + p)^{-2} (-pe^{-t}) \\ E(X^2) &= M''(0) = -2 p^{-3} (-p) + p^{-2} (-p) = 2p^{-2} - p^{-1} = \dfrac{2-p}{p^2} \\ Var(X) &= E(X^2) - (E(X))^2 = \dfrac{2-p}{p^2} - \left( \dfrac1p \right)^2 = \dfrac{1-p}{p^2} \end{align}

And this result implies that the standard deviation of a geometric distribution is given by $\sigma = \dfrac{ \sqrt{1-p}}{p}$.