Pearson's Goodness-of-Fit Test

Pearson's Goodness-of-Fit Test is a very common and useful test for several purposes. It can help determine whether a set of claimed proportions is likely, or whether a pair of categorical variables are independent.

The Formula

Pearson's Goodness-of-Fit Test uses the following test statistic.

$\chi^2 = \sum\limits_{i=1}^k \dfrac{(O_i - E_i)^2}{E_i}$

In this formula, $O_i$ is a count of the number of observed items in category $i$, and $E_i$ is the number of expected items in category $i$. Since the binomial formula forms the foundation of this test, the expected number of items in a category is determined by the expected value of a binomial random variable. That is, $E_i = np_i$, where $n$ is the number of observations, and $p_i$ is the probability of obtaining an observation in category $i$. In order to use this test, each value of $E_i$ must be at least 5.

The value $k$ is the number of categories, and as the formula indicates, uses the chi-square statistic. When it is used to test several proportions simultaneously, then the number of degrees of freedom will be $k - 1$. When it is used to test whether two categorical variables are independent, then the raw data is typically arranged in $r$ rows and $c$ columns, with $k = rc$. In this case, the number of degrees of freedom will be $(r-1)(c-1)$.

Pearson's Goodness-of-Fit Test is always a right-tailed test. A value of $\chi^2 = 0$, at the extreme left end of the distribution, would be equivalent to a perfect fit.

The formula given above is the formula traditionally quoted, yet a slightly easier formula exists for computational purposes. The easier formula will produce the exact same value for the test statistic, and it is derived as follows:

\begin{equation} \sum\limits_{i=1}^k \dfrac{(O_i - E_i)^2}{E_i} = \sum\limits_{i=1}^k \left( \dfrac{O_i^2}{E_i} - 2O_i + E_i \right) = \left( \sum\limits_{i=1}^k \dfrac{O_i^2}{E_i} \right) - n \end{equation}

If you are having to do the computations "by hand", this result is easier. The benefit of the traditional formula is to be able to identify which categories are furthest "out of line" by their larger contribution to the value of the test statistic.

Testing Several Proportions

A local college claims that 32% of its graduates are in humanities, 28% in liberal arts, 26% in the biological sciences, and 14% in the physical sciences. A random sample of 385 students finds 148 humanities graduates, 112 liberal arts graduates, 89 biological science graduates, and 36 physical science graduates. Perform a hypothesis test on the college's claimed proportions.

we are told that $n = 385$, $p_1 = 0.32$, $p_2 = 0.28$, $p_3 = 0.26$, and $p_4 = 0.14$. From these we can compute the expected values, and the quantity $\dfrac{(O_i-E_i)^2}{E_i}$, all of which we display in the following table.

Department	Humanities	Liberal Arts	Biological Sciences	Physical Sciences
Observed	148	112	89	36
Expected	123.2	107.8	100.1	53.9
$\dfrac{(O_i-E_i)^2}{E}$	4.992	0.164	1.231	5.945

All of the expected values are greater than 5, so the hypothesis test proceeds as follows.

The hypotheses are:
$H_0: \text{The proportions are as claimed by the college.}$
$H_a: \text{At least one of the claimed proportions is incorrect.}$
We shall choose $\alpha = 0.05$.
The test statistic is
$\chi^2 = \sum\limits_{i=1}^k \dfrac{(O_i-E_i)^2}{E_i} = 4.992 + 0.164 + 1.231 + 5.945 = 12.332$
The test has $k-1 = 3$ degrees of freedom.
The p-value is $p = \chi^2\text{cdf} (12.332, \infty, 3) = 0.0063$.
Since $p < \alpha$, we reject $H_0$.
The evidence indicates that the proportions are different than those claimed by the college.

As you probably noted in the example above, we did not write out all of the proportions that were being claimed in our hypothesis test. They were certainly present in the original problem, and we could have included them in our table. But standard practice is to write the hypothesis test in words, as a long list of proportions is typically not very informative. But for the mathematically inclined, the null hypothesis could have been written as:

\begin{equation} H_0: p_1 = 0.32, p_2 = 0.28, p_3 = 0.26, p_4 = 0.14 \end{equation}

Testing Independence

A car manufacturer wants to know if in the preference of customers, vehicle style and color are independent or not. They randomly sample their sales in the past year, and observe the following results.

	Silver	Black	White	Red	Totals
Sedan	21	28	16	23	88
Minivan	17	15	19	18	69
Truck	13	22	18	20	73
Totals	51	65	53	61	230

We have not been provided with expected preferences, but we can determine these based on the values in the table. The formula is $np_iq_j$, where $n$ is the total number of observations, $p_i$ is the probability of obtaining an observation in row $i$, and $q_j$ is the probability of obtaining an observation in column $j$. Computing the probability for a silver sedan, we would have:

\begin{equation} np_1 q_1 = 230 \left(\dfrac{88}{230}\right) \left(\dfrac{51}{230}\right) = \dfrac{88 \times 51}{230} = 19.513 \end{equation}

Notice the cancellation that occurred, which allows us to skip the computation of each marginal probability, and compute with only the observed counts. The completed table of expected values follows.

	Silver	Black	White	Red
Sedan	19.513	24.870	20.278	23.339
Minivan	15.3	19.5	15.9	18.3
Truck	16.187	20.630	16.822	19.361

We note that all of the expected values are greater than 5. Using the tables for both the observed and expected values, we compute the quantity $\dfrac{(O_i-E_i)^2}{E_i}$. The results are in the following table.

	Silver	Black	White	Red
Sedan	0.1133	0.3939	0.9025	0.0049
Minivan	0.1889	1.0385	0.6044	0.0049
Truck	0.6275	0.0910	0.0825	0.0211

The hypothesis test proceeds as follows:

The hypotheses are:
$H_0: \text{ vehicle style and color are independent}$
$H_a: \text{ vehicle style and color are related}$
We shall choose $\alpha = 0.05$.
The test statistic is
$\chi^2 = \sum\limits_{i=1}^k \dfrac{(O_i-E_i)^2}{E_i} = 4.073$
We have $(r-1)(c-1) = (3-1)(4-1) = 6$ degrees of freedom.
The p-value is $\chi^2 \text{cdf} (4.073,\infty,6) = 0.6668$.
Since $p > \alpha$, we fail to reject $H_0$.
There is insufficient evidence to conclude that vehicle style and color are related.

Mathematical Justifications

We begin with the binomial case, where there are just two categories. For this situation, $p_1 + p_2 = 1$, and $x_1 + x_2 = n$, so we have the following result.

\begin{align} \sum\limits_{i=1}^2 \dfrac{(O_i - E_i)^2}{E_i} &= \dfrac{(x_1 - np_1)^2}{np_1} + \dfrac{(x_2 - np_2)^2}{np_2} \\ &= \dfrac{(x_1 - np_1)^2}{np_1} + \dfrac{(n - x_1 - n(1 - p_1))^2}{n(1-p_1)} \\ &= \dfrac{(x_1 - np_1)^2}{np_1} \dfrac{(1-p_1)}{(1-p_1)} + \dfrac{(x_1 - np_1)^2}{np_1} \dfrac{p_1}{p_1} \\ &= \dfrac{(x_1 - np_1)^2}{np_1(1-p_1)} \\ &= \left( \dfrac{x_1 - \mu}{\sigma} \right)^2 \\ &= z_1^2 \end{align}

Remembering that binomial distributions are approximately normal when $np \ge 5$ and $n(1-p) \ge 5$, and that a normally distributed random variable $X$ implies that $Z^2$ has a chi-square distribution with one degree of freedom, we see that the Pearson Goodness-of-Fit Test will have approximately a chi-square distribution.

Next, we consider the equiprobable case, where each of the $k$ categories has probability $p = \dfrac{1}{k}$. The argument proceeds as follows:

\begin{align} \sum\limits_{i=1}^k \dfrac{(O_i - E_i)^2}{E_i} &= \sum\limits_{i=1}^k \dfrac{(x_i - np_i)^2}{np_i} \dfrac{(1-p_i)}{(1-p_i)} \\ &= \sum\limits_{i=1}^k (1 - p_i) \left( \dfrac{x_i - np_i}{np_i(1-p_i)} \right)^2 \\ &= \sum\limits_{i=1}^k \left( 1 - \dfrac{1}{k} \right) \left( \dfrac{x_i - \mu}{\sigma} \right)^2 \\ &= \dfrac{k-1}{k} \sum\limits_{i=1}^k z^2 \end{align}

Now the $k$ terms of $Z^2$ are identically distributed (so no subscript is required), but unfortunately only $k-1$ terms are independent. But the fraction $\dfrac{k-1}{k}$ effectively changes the sum of $k$ terms into a sum of $k-1$ terms. And thanks to the additivity of chi-square distributions, as well as the necessary assumption for approximating a binomial with a normal distribution, the final expression will have an approximately chi-square distribution with $k-1$ degrees of freedom.

In the independence test, if each row has equal probability, and each column has equal probability, then the argument is quite similar. Let us assume that $p = \dfrac{1}{r}$ represents the probability of each row, and $q = \dfrac{1}{c}$ is the probability of each column. The expected value for each individual cell is $E_{ij} = np_i q_j$, while the expected value for a row is $E_i = np_i$. We then have:

\begin{align} \sum\limits_{j=1}^c \sum\limits_{i=1}^r \dfrac{(O_{ij} - E_{ij})^2}{E_{ij}} &= \sum\limits_{j=1}^c \sum\limits_{i=1}^r \dfrac{(x_{ij} - np_iq_j)^2}{np_i q_j} \dfrac{(1-p_i)(1-q_j)}{(1-p_i)(1-q_j)} \\ &= \sum\limits_{j=1}^c \sum\limits_{i=1}^r (1-p_i)(1-q_j) \left( \dfrac{x_{ij} - np_iq_j}{np_i(1-p_i)q_j(1-q_j)} \right)^2 \\ &= \sum\limits_{j=1}^c \sum\limits_{i=1}^r \left(1 - \dfrac{1}{r} \right) \left(1 - \dfrac{1}{c} \right) \left( \dfrac{x_{ij} - \mu}{\sigma} \right)^2 \\ &= \dfrac{(r-1)(c-1)}{rc} \sum\limits_{j=1}^c \sum\limits_{i=1}^r z^2 \end{align}

As with the previous case, the sum involves $rc$ terms of $Z^2$, which are identically distributed but not independent. But the fraction $\dfrac{(r-1)(c-1)}{rc}$ effectively gives just $(r-1)(c-1)$ terms, which is the number of independent variables. Thus, assuming expected cell frequencies are sufficient for approximating a binomial with a normal distribution, the final expression will have a chi-square distribution with degrees of freedom equal to $(r-1)(c-1)$.

The general case is much more difficult. Basically, the test statistic asymptotically approaches the $\chi^2$ distribution as the number of categories and the number of observations both approach infinity, while their ratio approaches a constant. As in the previous cases, and with the previous assumptions, the test statistic will have approximately a chi-square distribution (although probably less approximately than the previous cases). It is this general case that is most frequently encountered in applications.