Simpson's Paradox in Statistical Data Analysis

Blog Post
March 03, 2020

Can Big Data Lie: A Statistical Paradox

While I have spent most of my career exploring and leveraging data science and its associated rigorous analytics, only in the past decade or so have so many managers accepted the obvious—that data-driven decision-making far surpasses the old seat-of-the-pants approach. The sophisticated warehouses, the acceptance of cloud computing and the huge availability of data have all contributed to this new way of thinking. Read the following incredible thought that appeared in Forbes magazine:

“The amount of data we produce every day is truly mind-boggling. There are 2.5 quintillion bytes of data created each day at our current pace, but that pace is only accelerating with the growth of the Internet of Things (IoT). Over the last two years alone 90 percent of the data in the world was generated.”

The need to remain competitive, coupled with analytics tools and processes, have been instrumental in making data science/ data analysis and interpretation critical in a huge domain of disciplines. Subjective investigation is now objective evaluation. We use these tools to interpret and convert data into telling insight.

This does not mean to say that all our analyses are without fault. Inconsistent results often pop up, and for the novice analyst—and truth be told sometimes even for the more experienced one—an uncomfortable and bewildering situation can occur. Let’s go through one such well-known statistical paradox.

It is not uncommon in the analytics workspace to arrive at what appears to be a logical conclusion but, upon further quantitative juggling, an entirely different story appears. These apparent contradictions force the analyst to prod the data a bit more and determine what the real issues are. Faulty decisions and incorrect insights can easily be made if appropriate care is not taken to both analyze and interpret what is actually occurring.

What could be more typical than evaluating and interpreting campaign results from two different marketing programs? After all, when deciding to deploy a strategic alternative, metrics are compared among the choices and the winner is selected as the new standard to work with. What could be simpler? The old A/B comparisons have probably been around longer than the term “analytics” has.

So let’s take a quick look at the results of an email program that I recently observed. A new creative was evaluated against the current control package. Four segments of customers, based on income, were identified by the marketer. The results were monitored and tabulated by these groups.

rate data

The marketing manager was quite excited about these four segments and she immediately scanned each of the four rows in the table. Wow, was she disappointed! In each case, the control outperformed the test program. Below, the results are graphically displayed.

Click rate test vs control chart

Looks like management has not found a good alternative to the control.

But wait a minute. While it is true each of the four segments did not outperform the control, in total, we clearly note that the test program surpassed the standard control package—a 1.21% rate vs. 1.16%. Appropriate statistical tests were performed, and the results demonstrated a significant difference exists between the two programs! The results are displayed below.

Click rate total chart

What’s going on? How do we explain this? Which program is better? How can results by segment point to a conclusion that is in direct contradiction to results for the total? Seems like a paradox. Indeed, researchers refer to this sort of incongruity as Simpson’s paradox.

I first became familiar with this sort of inconsistency as I was reading an article quite a while ago (longer than I care to admit) on health outcomes of smokers vs. nonsmokers. The prevalence of current cigarette smoking among adults has certainly declined. However, more than 42 million Americans still smoke, and mortality rates are still too high. Take a look at a similar analysis that was summarized in a lecture entitled “How to lie with statistics.” A sample of individuals were studied and the results were recorded as follows:

Mortality level data

Mortality level chart

Wow! What’s going on here? Is it possible? Nonsmokers have higher mortality rates than smokers? Let’s see what happens when we segment by age.

Mortality by age data

Viewing the results graphically presents an entirely different picture. Now, we conclude that smokers have higher mortality rates for virtually every age group. Strange isn’t it?

Mortality by age

Again, we see Simpson’s paradox rear its ugly head. It shows that a pattern present within multiple groups can reverse when the groups are combined. This inconsistency is well known to experienced statisticians but bewildering to many novices.

However, as we take the statistical data analysis one step further and note the distribution by age group, we can begin to understand what is really going on.

Distribution by age range

The age distributions vary considerably for smokers and nonsmokers. Essentially, the nonsmoking sample appears to be older. About 1 out of 4 nonsmokers reside in the two oldest categories. For the smokers, only about 1 in 12 are in these ranges. The greater percentage of older nonsmokers drives up the average for that segment. Analyzing the results by age suggests a more reasonable and comfortable conclusion. It is a result more palatable with accepted health theories that long-term smoking reduces life expectancy, thereby impacting the age distributions.

Simpson’s paradox can emerge when we witness two phenomena:

  1. An ordinary variable thought to be insightful exists and needs to be cross tabulated with the “what we are trying to explain” variable. This variable is age in the smoker example and income in the marketing illustration.
  2. The sample selected from each of these categories (age groupings) are substantially different in size.

When these circumstances exist, they team up to create paradoxical relationships—Simpson’s paradox.

To lessen the chances of experiencing this problem, the manager should take the necessary precautions to construct the groups to be as comparable as possible. Be extra careful as you decide on sample sizes.

If, however, the test involves only the marketing group—for example, analyzing best customers—then Simpson’s paradox is not an issue. Our concern here is how the group behaves. We have little interest in how the entire sample performs.

For those of you comfortable with some algebra manipulation, look closely at the following illustration that is nothing more than an application of Simpson’s paradox.

  1. 1/3 < 3/8 (.33 < .375)
  2. 7/10 < 3/4 (.70 < .75)

Now let’s combine numerators and denominators.

  1. (1+7)/(3+10) / (3+3)/(8+4)
  2. 8/13 ? 6/12
  3. .615 > .50
  4. This is inconsistent with A and B above—Simpson’s paradox!

While A and B are certainly true, combining them will result in a change in direction (from less than to greater than) for the fractions! A similar treatment of these inequalities can be found in the Stanford Encyclopedia of Philosophy.

Why is Simpson’s paradox noteworthy? Of course, results can be misinterpreted. Additionally, however, there is one rule that many analysts disregard. Results are not always what they appear to be. It’s critical that we understand where the data came from and how the statistical data analysis was designed. While some may consider these notions not part of a data scientist’s job, they really are. Many analysts aren’t taught to think this way. They should be. Remember the cardinal rule in analytics: garbage in, garbage out.