Powered by MathJax
We use MathJax

Approximately Normal Distributions with Discrete Data

If a random variable is actually discrete, but is being approximated by a continuous distribution, a continuity correction is needed. This is to more closely match the areas of bars in a discrete distribution with the areas under the curve of a continuous distribution.

Continuity Corrections

Discrete data take on only integer values, while continuous data can have any real number value. Since areas under a continuous distribution give probabilities, and not the height of the curve, we need to adjust our approach to finding probabilities when discrete data are involved. So we think of the half-way points between consecutive discrete values, and these become the boundaries of the bars that represent the area under the curve for that discrete value. As an example, if the discrete values include   ${..., 75, 76, 77, ...}$,   then the bar for 76 would run from 75.5 to 76.5, and we would find   $P(X = 76)$   by computing   $P(75.5 < X < 76.5)$   instead.

Example: The number of bags lost by a small airline per week is approximately normally distributed with a mean of 427 bags and a standard deviation of 35 bags. What is the probability that they lose exactly 430 bags next week?

Since the airline cannot lose a fraction of a bag, the random variable is discrete. The bar for 430 in the histogram of the probability distribution would have its left and right endpoints at 429.5 and 430.5. Therefore, we find z-scores for these two x-values. They are   $z = \dfrac{429.5-427}{35} \approx 0.0714$   and   $z = \dfrac{430.5-427}{35} = 0.10$.   Then

\begin{align} P(x = 430) &= P(429.5 < x < 430.5) = P(0.0714 < z < 0.10) \\ &= \Phi(0.10) - \Phi(0.0714) = \operatorname{normalcdf}(0.0714,0.10) \approx 0.0114 \end{align}

In other words, there is a 1.14% probability that the airline would lose exactly 430 bags next week.

Example: For the same airline, what is the probability that they lose at most 420 bags next week?

The discrete event "at most 420" is approximated with the continuous event "420.5 or less", as both will include the random variable value 420, but not 421. Therefore,   $z = \dfrac{420.5-427}{35} = -0.1857$.   Then we have

\begin{align} P(x \le 420) &= P(X < 420.5) = P(z < -0.1857) \\ &= \Phi(-0.1857) = \operatorname{normalcdf}(-\infty,-0.1857) = 0.4263 \end{align}

In other words, there is a 42.63% probability that they will have at most 420 lost bags.

Using the Normal Distribution to Approximate a Binomial Distribution

Binomial distributions are considered approximately normal when the expected number of successes and the expected number of failures are both at least 5. Formulaically, this translates into the requirement that   $np \ge 5$   and   $n(1-p) \ge 5$.

Example: On every one of the 100 questions on a multiple choice test, Tom randomly chooses one of the four answer options. What can Tom expect for the number of correct answers? To answer this, we use the expected value formula for a binomial distribution. Therefore,   $E(X) = np = 100 \left( \dfrac14 \right) = 25$  . He can expect 25 correct answers on average.

Example: What is the standard deviation for the number of correct answers on Tom's test? From the variance formula for a binomial distribution, we have   $Var(X) = np(1-p) = 100 \left( \dfrac14 \right) \left(1 - \dfrac14 \right) = \dfrac{75}{4}$.   Therefore, the standard deviation is   $\sigma = \sqrt{\dfrac{75}{4}} \approx 4.33$   bags.

Example: What is the probability that Tom would obtain at least 35 correct answers? We first note that Tom is expecting 25 correct answers, and 75 incorrect answers. Both of these values are greater than 5, and therefore the normal distribution will provide a reasonable approximation to this binomial problem. Since the number of correct answers is discrete, a continuity correction is needed, and we use   $x = 34.5$.   Thus   $z = \dfrac{34.5 - 25}{4.33} = 2.19$,   and

\begin{align} P(x \ge 35) &= P(x > 34.5) = P(z > 2.19) \\ &= 1 - \Phi(2.19) = \operatorname{normalcdf}(2.19,\infty) \approx 0.0143 \end{align}

Tom has a 1.43% chance of obtaining at least 35 correct answers on his test.

Example: What is the probability that Tom would obtain a passing score of at least 60 correct answers? This is very similar to the previous question. We use   $x = 59.5$,   which gives   $z = 7.97$.   We compute

\begin{align} P(x \ge 60) &= P(x > 59.5) = P(x > 7.97) \\ &= 1 - \Phi(7.97) = \operatorname{normalcdf}(7.97,\infty) \approx 8 \times 10^{-16} \end{align}

With a probability of less than $10^{-15}$, Tom has almost no chance of passing the test.

Using the Normal Distribution to Approximate a Poisson Distribution

Poisson distributions are considered to be approximately normal when the expected rate of successes per unit time is at least 10. Since the expected value formula for a Poisson distribution is   $E(X) = \lambda$,   the requirement is that   $\lambda \ge 10$.

Example: Cars arrive at the drive-up window of a local fast food restaurant at the rate of 40 per hour. Assume the arrivals are independent of one another, so that the Poisson distribution applies. What are the mean and standard deviation of the number of arrivals per hour? Using the Poisson distribution formulas, we have   $E(X) = \lambda = 40$   and   $\sigma = \sqrt{\lambda} \approx 6.32$. We can expect 40 cars per hour to arrive on average, with a standard deviation of 6.32 cars.

Example: At the same drive-up window, What is the probability that exactly 25 cars will arrive in the next hour? Since   $\lambda \ge 10$,   we can use the normal distribution as an approximation. The required continuity correction gives x-values of 24.5 and 25.5, which produces z-scores of   $z = \dfrac{24.5-40}{6.32} = -2.45$   and   $z = \dfrac{25.5-40}{6.32} = -2.29$.   Therefore

\begin{align} P(x = 25) &= P(24.5 < x < 25.5) = P(-2.45 < z < -2.29) \\ &= \Phi(-2.29) - \Phi(-2.45) = \operatorname{normalcdf}(-2.45,-2.29) \approx 0.0039 \end{align}

There is a 0.39% probability that exactly 25 cars would arrive in the next hour.