Simpson's Paradox

Percentages have the ability to display many unintuitive results. Simpson's Paradox states that one conclusion may be reached when data is analyzed in the aggregate, and the opposite conclusion may be reached when data is analyzed in smaller groups.

An Example

A civic orchestra is being formed in your community, and is holding auditions. Among the wind musicians, 26 of the 40 men are invited to join, and 4 of the 6 women. Among the string musicians, 30 of the 80 men are invited, and 19 of the 49 women. Which gender was favored in the selection of musicians?

This looks quite easy. We can obtain totals for the men and women, and then convert the data into percents. We find:

Among the 120 men, 56 were offered a position. That is, $\dfrac{56}{120}=0.4667=46.67\%$ of the men succeeded.
Among the 55 women, 23 were offered a position. That is, $\dfrac{23}{55}=0.4182=41.82\%$ of the women succeeded.

It looks like there was a small bias against the women. So where did this discrimination come from, the strings or the winds? Again, we compute some percentages.

Among the male wind musicians, $\dfrac{26}{40}=0.65 = 65\%$ were invited.
Among the female wind musicians, $\dfrac46=0.6667 = 66.67\%$ were invited.
Among the male string musicians, $\dfrac{30}{80}=0.375 = 37.5\%$ were invited.
Among the female string musicians, $\dfrac{19}{49}=0.3878 = 38.78\%$ were invited.

In each class of instruments, we find that the percentage of females was (very slightly) greater than the percentage of males. What happened to the bias against the women? This is quite contrary to our original conclusion.

An Explanation

The two groups (winds and strings) did not accept musicians at the same rate. Among the strings, we see $\dfrac{49}{129} = 0.3798 = 37.98\%$ of the applicants were accepted. And both the men and the women were accepted at about that rate. Among the winds, we see $\dfrac{30}{46}=0.6522=65.22\%$ of the applicants were accepted. And again, both the men and the women were accepted at about that rate. But we notice that the acceptance rate for the strings was far less than for the wind instruments.

So, if you were to join this orchestra and could play every instrument, what position would you choose to maximize your chance of being accepted? Obviously, you would audition to play a wind instrument with the orchestra. In this example, almost all of the applicants for the wind positions were male, and they essentially pulled up the acceptance rate for the men. Since so few women auditioned on wind instruments, their acceptance rate was far closer to the acceptance rate of the strings in general.

In other words, Simpson's Paradox arises when we try to average percentages (or probabilities). The percentages must be weighted according to the underlying sample sizes. Therefore, we could state the paradox as follows.

Even though $P(A_1) < P(B_1)$ and $P(A_2) < P(B_2)$,

there will still be suitable values of $m_1$, $m_2$, $n_1$, and $n_2$ such that

$\dfrac{m_1 P(A_1) + m_2 P(A_2)}{m_1+m_2} > \dfrac{n_1 P(B_1) + n_2 P(B_2)}{n_1+n_2}$.

In our original example, events $A_i$ would be that a male was selected, $B_i$ would be that a female was selected, with the subscript denoting the particular class of instrument played. Then the quantities $m_i$ and $n_i$ are the number of applicants for each position.

The original question was whether one gender was favored over another in being accepted into the orchestra. In this question, there are two variables, gender and acceptance. When we examined the data by section of the orchestra, we were actually introducing a third variable, often called a lurking variable. The lurking variable had a great deal to do with the final results, although it was not apparent when the data was analyzed in the aggregate.

Simpson's Paradox in Life

When Simpson's Paradox occurs in the analysis of data from two groups, a comparison in the aggregate will find one group excelling, while a comparison by partitioning the data will find the other group excelling. Some situations where the paradox actually arose include:

A comparison of men's and women's admission rates at Berkeley in the 1970s seemed to indicate discrimination against women, but an analysis of admission data at the department level found women were generally favored.
A comparison of two treatments for kidney stones in the 1980s identified one as better overall, yet when the size of the kidney stone was considered as a factor, it was found that the other treatment was more successful in each subgroup.
A comparison of flight delays for two airlines in the early 1990s found one airline delayed more often, yet when the data was broken down, the other airline had more delays on a city by city comparison.
A comparison of baseball batting averages for Derek Jeter and David Justice found that Justice had a higher average in both 1995 and 1996, but if the years were combined, Jeter's overall batting average was better.
A comparison of the recessions in 1982 and 2009 found that the total unemployment rate in 1982 was higher, but when the data was broken down by education level, each subgroup's unemployment rate was higher in 2009.

Through an internet search, you can find more about any of these situations, and many other situations as well.