Home/Blog/Data Science/Simpson's paradox: the paradox of aggregation

Simpson's paradox: the paradox of aggregation

2 min read

Apr 03, 2024

content

Yule-Simpson effect

Example

Conclusion

It’s often desirable to mix chemicals to get better results in chemistry. The same is true for combining statistical data. However, just like in chemistry, it is not always the case that the results will be better. Sometimes they are, sometimes they aren’t.

An observation is that a relationship between variables appears (or disappears) when conducting experiments, but the relationship might disappear (or appear) when we divide the variables into subgroups. For example, a relationship exists between salary and age, i.e., salary increases with age. However, the relationship disappears when we divide the age into subpopulations (young and old).

Yule-Simpson effect#

Let $A,$ $B$ , and $C$ be events with $A',$ $B’$ & $C’$ as their complement events, respectively. Let’s consider a situation where event $A$ is more likely to occur whenever $B$ does not occur.

Here, it's clear that the aggregated results differ altogether from the individual results. The aggregated result (disregarding sex) is indifferent to any procedure. While here we see that surgery has a higher chance of success for male patients and a higher chance of failure for female patients. At the same time, medical treatment is almost equally successful for male and female patients.

Conclusion#

The paradox of aggregation, also commonly known as Simpson’s paradox, is not really a paradox. It’s just that patterns are obfuscated when data features are aggregated or separated.

Why is this important for data scientists? Data scientists base their inferences and design models on data features. Simpson’s paradox provides an important insight that data scientists should look at different dataset features and perform a thorough analysis. This could help reduce bad decisions made by their models.

Data-Centric Statistical Inference Using R and Tidyverse

Data-Centric Statistical Inference Using R and Tidyverse

The world is full of data and it can be quite challenging to infer the knowledge based on statistical reasoning. This course aims to equip you with the skills you need to play with your data, wrangle it, and visualize it. Starting with data visualization, the course gets learners building ggplot2 graphs early on and then continues to reinforce important concepts graphically throughout the course. After moving through data wrangling and data importing, you’ll be introduced to modeling, which plays a prominent role, with a focus on building regression models and inference for regression. Lastly, statistical inference is presented through a computational lens with calculations done via the infer package. By the end of this course, you’ll develop your data science toolbox, equipping yourself with tools such as data visualization, data formatting, data wrangling, and data modeling using regression.

17hrs

Beginner

311 Playgrounds

10 Quizzes

Written By:

Zahid Irfan

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources

Procedure	Success (A)	Failure (A')
Surgery (B)	25	25
Medicine (B')	15	15

	Male (C)		Female (C')
Procedure	Success (A)	Failure (A')	Success (A)	Failure (A')
Surgery (B)	15 (65%)	8 (35%)	10 (37%)	17 (63%)
Medicine (B')	8 (53%)	7 (47%)	7 (47%)	8 (53%)

Simpson's paradox: the paradox of aggregation

Yule-Simpson effect#

Example#

#

Conclusion#