Powered by MathJax
We use MathJax

Pearson's Goodness-of-Fit Test

Pearson's Goodness-of-Fit Test is a very common and useful test for several purposes. It can help determine whether a set of claimed proportions is likely, or whether a pair of categorical variables are independent.

The Formula

Pearson's Goodness-of-Fit Test uses the following test statistic.

$\chi^2 = \sum\limits_{i=1}^k \dfrac{(O_i - E_i)^2}{E_i}$

In this formula, $O_i$ is a count of the number of observed items in category $i$, and $E_i$ is the number of expected items in category $i$. Since the binomial formula forms the foundation of this test, the expected number of items in a category is determined by the expected value of a binomial random variable. That is,   $E_i = np_i$,   where $n$ is the number of observations, and $p_i$ is the probability of obtaining an observation in category $i$.   In order to use this test, each value of $E_i$ must be at least 5.

The value $k$ is the number of categories, and as the formula indicates, uses the chi-square statistic. When it is used to test several proportions simultaneously, then the number of degrees of freedom will be   $k - 1$.   When it is used to test whether two categorical variables are independent, then the raw data is typically arranged in $r$ rows and $c$ columns, with   $k = rc$.   In this case, the number of degrees of freedom will be   $(r-1)(c-1)$.

Pearson's Goodness-of-Fit Test is always a right-tailed test. A value of   $\chi^2 = 0$,   at the extreme left end of the distribution, would be equivalent to a perfect fit.

The formula given above is the formula traditionally quoted, yet a slightly easier formula exists for computational purposes. The easier formula will produce the exact same value for the test statistic, and it is derived as follows:

\begin{equation} \sum\limits_{i=1}^k \dfrac{(O_i - E_i)^2}{E_i} = \sum\limits_{i=1}^k \left( \dfrac{O_i^2}{E_i} - 2O_i + E_i \right) = \left( \sum\limits_{i=1}^k \dfrac{O_i^2}{E_i} \right) - n \end{equation}

If you are having to do the computations "by hand", this result is easier. The benefit of the traditional formula is to be able to identify which categories are furthest "out of line" by their larger contribution to the value of the test statistic.

Testing Several Proportions

A local college claims that 32% of its graduates are in humanities, 28% in liberal arts, 26% in the biological sciences, and 14% in the physical sciences. A random sample of 385 students finds 148 humanities graduates, 112 liberal arts graduates, 89 biological science graduates, and 36 physical science graduates. Perform a hypothesis test on the college's claimed proportions.

we are told that   $n = 385$, $p_1 = 0.32$, $p_2 = 0.28$, $p_3 = 0.26$, and $p_4 = 0.14$.   From these we can compute the expected values, and the quantity $\dfrac{(O_i-E_i)^2}{E_i}$, all of which we display in the following table.

Department Humanities Liberal Arts Biological Sciences Physical Sciences
Observed 148 112 89 36
Expected 123.2 107.8 100.1 53.9
$\dfrac{(O_i-E_i)^2}{E}$ 4.992 0.164 1.231 5.945

All of the expected values are greater than 5, so the hypothesis test proceeds as follows.

As you probably noted in the example above, we did not write out all of the proportions that were being claimed in our hypothesis test. They were certainly present in the original problem, and we could have included them in our table. But standard practice is to write the hypothesis test in words, as a long list of proportions is typically not very informative. But for the mathematically inclined, the null hypothesis could have been written as:

\begin{equation} H_0: p_1 = 0.32, p_2 = 0.28, p_3 = 0.26, p_4 = 0.14 \end{equation}

Testing Independence

A car manufacturer wants to know if in the preference of customers, vehicle style and color are independent or not. They randomly sample their sales in the past year, and observe the following results.

  Silver Black White Red Totals
Sedan 21 28 16 23 88
Minivan 17 15 19 18 69
Truck 13 22 18 20 73
Totals 51 65 53 61 230

We have not been provided with expected preferences, but we can determine these based on the values in the table. The formula is $np_iq_j$, where $n$ is the total number of observations, $p_i$ is the probability of obtaining an observation in row $i$, and $q_j$ is the probability of obtaining an observation in column $j$. Computing the probability for a silver sedan, we would have:

\begin{equation} np_1 q_1 = 230 \left(\dfrac{88}{230}\right) \left(\dfrac{51}{230}\right) = \dfrac{88 \times 51}{230} = 19.513 \end{equation}

Notice the cancellation that occurred, which allows us to skip the computation of each marginal probability, and compute with only the observed counts. The completed table of expected values follows.

  Silver Black White Red
Sedan 19.513 24.870 20.278 23.339
Minivan 15.3 19.5 15.9 18.3
Truck 16.187 20.630 16.822 19.361

We note that all of the expected values are greater than 5. Using the tables for both the observed and expected values, we compute the quantity $\dfrac{(O_i-E_i)^2}{E_i}$. The results are in the following table.

  Silver Black White Red
Sedan 0.1133 0.3939 0.9025 0.0049
Minivan 0.1889 1.0385 0.6044 0.0049
Truck 0.6275 0.0910 0.0825 0.0211

The hypothesis test proceeds as follows:

Mathematical Justifications

We begin with the binomial case, where there are just two categories. For this situation,   $p_1 + p_2 = 1$,   and   $x_1 + x_2 = n$,   so we have the following result.

\begin{align} \sum\limits_{i=1}^2 \dfrac{(O_i - E_i)^2}{E_i} &= \dfrac{(x_1 - np_1)^2}{np_1} + \dfrac{(x_2 - np_2)^2}{np_2} \\ &= \dfrac{(x_1 - np_1)^2}{np_1} + \dfrac{(n - x_1 - n(1 - p_1))^2}{n(1-p_1)} \\ &= \dfrac{(x_1 - np_1)^2}{np_1} \dfrac{(1-p_1)}{(1-p_1)} + \dfrac{(x_1 - np_1)^2}{np_1} \dfrac{p_1}{p_1} \\ &= \dfrac{(x_1 - np_1)^2}{np_1(1-p_1)} \\ &= \left( \dfrac{x_1 - \mu}{\sigma} \right)^2 \\ &= z_1^2 \end{align}

Remembering that binomial distributions are approximately normal when   $np \ge 5$   and   $n(1-p) \ge 5$,   and that a normally distributed random variable $X$ implies that $Z^2$ has a chi-square distribution with one degree of freedom, we see that the Pearson Goodness-of-Fit Test will have approximately a chi-square distribution.

Next, we consider the equiprobable case, where each of the $k$ categories has probability   $p = \dfrac{1}{k}$.   The argument proceeds as follows:

\begin{align} \sum\limits_{i=1}^k \dfrac{(O_i - E_i)^2}{E_i} &= \sum\limits_{i=1}^k \dfrac{(x_i - np_i)^2}{np_i} \dfrac{(1-p_i)}{(1-p_i)} \\ &= \sum\limits_{i=1}^k (1 - p_i) \left( \dfrac{x_i - np_i}{np_i(1-p_i)} \right)^2 \\ &= \sum\limits_{i=1}^k \left( 1 - \dfrac{1}{k} \right) \left( \dfrac{x_i - \mu}{\sigma} \right)^2 \\ &= \dfrac{k-1}{k} \sum\limits_{i=1}^k z^2 \end{align}

Now the $k$ terms of $Z^2$ are identically distributed (so no subscript is required), but unfortunately only   $k-1$   terms are independent. But the fraction $\dfrac{k-1}{k}$ effectively changes the sum of $k$ terms into a sum of   $k-1$   terms. And thanks to the additivity of chi-square distributions, as well as the necessary assumption for approximating a binomial with a normal distribution, the final expression will have an approximately chi-square distribution with   $k-1$   degrees of freedom.

In the independence test, if each row has equal probability, and each column has equal probability, then the argument is quite similar. Let us assume that   $p = \dfrac{1}{r}$   represents the probability of each row, and   $q = \dfrac{1}{c}$   is the probability of each column. The expected value for each individual cell is   $E_{ij} = np_i q_j$,   while the expected value for a row is   $E_i = np_i$.   We then have:

\begin{align} \sum\limits_{j=1}^c \sum\limits_{i=1}^r \dfrac{(O_{ij} - E_{ij})^2}{E_{ij}} &= \sum\limits_{j=1}^c \sum\limits_{i=1}^r \dfrac{(x_{ij} - np_iq_j)^2}{np_i q_j} \dfrac{(1-p_i)(1-q_j)}{(1-p_i)(1-q_j)} \\ &= \sum\limits_{j=1}^c \sum\limits_{i=1}^r (1-p_i)(1-q_j) \left( \dfrac{x_{ij} - np_iq_j}{np_i(1-p_i)q_j(1-q_j)} \right)^2 \\ &= \sum\limits_{j=1}^c \sum\limits_{i=1}^r \left(1 - \dfrac{1}{r} \right) \left(1 - \dfrac{1}{c} \right) \left( \dfrac{x_{ij} - \mu}{\sigma} \right)^2 \\ &= \dfrac{(r-1)(c-1)}{rc} \sum\limits_{j=1}^c \sum\limits_{i=1}^r z^2 \end{align}

As with the previous case, the sum involves $rc$ terms of $Z^2$, which are identically distributed but not independent. But the fraction $\dfrac{(r-1)(c-1)}{rc}$ effectively gives just $(r-1)(c-1)$ terms, which is the number of independent variables. Thus, assuming expected cell frequencies are sufficient for approximating a binomial with a normal distribution, the final expression will have a chi-square distribution with degrees of freedom equal to $(r-1)(c-1)$.

The general case is much more difficult. Basically, the test statistic asymptotically approaches the $\chi^2$ distribution as the number of categories and the number of observations both approach infinity, while their ratio approaches a constant. As in the previous cases, and with the previous assumptions, the test statistic will have approximately a chi-square distribution (although probably less approximately than the previous cases). It is this general case that is most frequently encountered in applications.