It’s often desirable to mix chemicals to get better results in chemistry. The same is true for combining statistical data. However, just like in chemistry, it is not always the case that the results will be better. Sometimes they are, sometimes they aren’t.
An observation is that a relationship between variables appears (or disappears) when conducting experiments, but the relationship might disappear (or appear) when we divide the variables into subgroups. For example, a relationship exists between salary and age, i.e., salary increases with age. However, the relationship disappears when we divide the age into subpopulations (young and old).
Let
This could lead to a different result when we consider event
The likelihood of
Let’s look at an example where there are two ways to treat a particular disease: perform surgery or administer medicine.
Here, the events are as follows: { = Success, = Failure}, { = Surgery, = Medicine} and { = Male, = Female}. We use , , and to keep the representation consistent with the above equations.
Suppose a total of 50 cases opted for surgery and 30 for medicine. Looking at the following data, it's clear that neither surgery nor medicine shows clear success or failure.
Procedure | Success (A) | Failure (A') |
Surgery (B) | 25 | 25 |
Medicine (B') | 15 | 15 |
We can see that:
Now, let’s look at the data when broken down by sex:
Male (C) | Female (C') | |||
Procedure | Success (A) | Failure (A') | Success (A) | Failure (A') |
Surgery (B) | 15 (65%) | 8 (35%) | 10 (37%) | 17 (63%) |
Medicine (B') | 8 (53%) | 7 (47%) | 7 (47%) | 8 (53%) |
Let's look at the following probabilities:
Here, it's clear that the aggregated results differ altogether from the individual results. The aggregated result (disregarding sex) is indifferent to any procedure. While here we see that surgery has a higher chance of success for male patients and a higher chance of failure for female patients. At the same time, medical treatment is almost equally successful for male and female patients.
The paradox of aggregation, also commonly known as Simpson’s paradox, is not really a paradox. It’s just that patterns are obfuscated when data features are aggregated or separated.
Why is this important for data scientists? Data scientists base their inferences and design models on data features. Simpson’s paradox provides an important insight that data scientists should look at different dataset features and perform a thorough analysis. This could help reduce bad decisions made by their models.
Data-Centric Statistical Inference Using R and Tidyverse
The world is full of data and it can be quite challenging to infer the knowledge based on statistical reasoning. This course aims to equip you with the skills you need to play with your data, wrangle it, and visualize it. Starting with data visualization, the course gets learners building ggplot2 graphs early on and then continues to reinforce important concepts graphically throughout the course. After moving through data wrangling and data importing, you’ll be introduced to modeling, which plays a prominent role, with a focus on building regression models and inference for regression. Lastly, statistical inference is presented through a computational lens with calculations done via the infer package. By the end of this course, you’ll develop your data science toolbox, equipping yourself with tools such as data visualization, data formatting, data wrangling, and data modeling using regression.
Free Resources