Home/Blog/Data Science/Simpson's paradox: the paradox of aggregation
Home/Blog/Data Science/Simpson's paradox: the paradox of aggregation

Simpson's paradox: the paradox of aggregation

Zahid Irfan
Apr 03, 2024
2 min read

It’s often desirable to mix chemicals to get better results in chemistry. The same is true for combining statistical data. However, just like in chemistry, it is not always the case that the results will be better. Sometimes they are, sometimes they aren’t.

An observation is that a relationship between variables appears (or disappears) when conducting experiments, but the relationship might disappear (or appear) when we divide the variables into subgroups. For example, a relationship exists between salary and age, i.e., salary increases with age. However, the relationship disappears when we divide the age into subpopulations (young and old).

Yule-Simpson effect#

Let A,A,BB, and CC be events with A,A', BB’ & CC’ as their complement events, respectively. Let’s consider a situation where event AA is more likely to occur whenever BB does not occur.

This could lead to a different result when we consider event CC. The likelihood of AA given BB and CC might be more than the case where BB does not occur and CC occurs.

The likelihood of AA given that BB occurs and CC does not occur might be more than the case where AA occurs given both BB and CC don’t occur.

Example#

Let’s look at an example where there are two ways to treat a particular disease: perform surgery or administer medicine.

Here, the events are as follows: {AA = Success, AA' = Failure}, {BB = Surgery, BB' = Medicine} and {CC = Male, CC' = Female}. We use AA, BB, and CC to keep the representation consistent with the above equations.

Suppose a total of 50 cases opted for surgery and 30 for medicine. Looking at the following data, it's clear that neither surgery nor medicine shows clear success or failure.

Procedure

Success (A)

Failure (A')

Surgery (B)

25

25

Medicine (B')

15

15

We can see that:

#

Now, let’s look at the data when broken down by sex:

Male (C)

Female (C')

Procedure

Success (A)

Failure (A')

Success (A)

Failure (A')

Surgery (B)

15 (65%)

8 (35%)

10 (37%)

17 (63%)

Medicine (B')

8 (53%)

7 (47%)

7 (47%)

8 (53%)

Let's look at the following probabilities:

Here, it's clear that the aggregated results differ altogether from the individual results. The aggregated result (disregarding sex) is indifferent to any procedure. While here we see that surgery has a higher chance of success for male patients and a higher chance of failure for female patients. At the same time, medical treatment is almost equally successful for male and female patients.

Conclusion#

The paradox of aggregation, also commonly known as Simpson’s paradox, is not really a paradox. It’s just that patterns are obfuscated when data features are aggregated or separated.

Why is this important for data scientists? Data scientists base their inferences and design models on data features. Simpson’s paradox provides an important insight that data scientists should look at different dataset features and perform a thorough analysis. This could help reduce bad decisions made by their models.

Data-Centric Statistical Inference Using R and Tidyverse

Cover
Data-Centric Statistical Inference Using R and Tidyverse

The world is full of data and it can be quite challenging to infer the knowledge based on statistical reasoning. This course aims to equip you with the skills you need to play with your data, wrangle it, and visualize it. Starting with data visualization, the course gets learners building ggplot2 graphs early on and then continues to reinforce important concepts graphically throughout the course. After moving through data wrangling and data importing, you’ll be introduced to modeling, which plays a prominent role, with a focus on building regression models and inference for regression. Lastly, statistical inference is presented through a computational lens with calculations done via the infer package. By the end of this course, you’ll develop your data science toolbox, equipping yourself with tools such as data visualization, data formatting, data wrangling, and data modeling using regression.

17hrs
Beginner
311 Playgrounds
10 Quizzes

  

Free Resources