Home/Blog/Data Science/Causation vs. Correlation
Home/Blog/Data Science/Causation vs. Correlation

Causation vs. Correlation

5 min read
Mar 22, 2024
content
Causation
Example of causation
Correlation 
Example of correlation

Humans are curious and therefore tend to observe. Most scientific discoveries were the result of curiosity coupled with observations. The next step was to relate factors with each other. It would have taken an immense amount of ingenuity for the Greeks to connect the appearance of tides with the moon’s phases. 

Causation#

Causation is simply saying that one event causes another event. Causation or causality can be expressed in a deterministic or probabilistic context. If we have a cause, we can say it leads to an effect (deterministic context). In the probabilistic case, we say that a cause will increase the likelihood or probability of the effect (probabilistic context). Of course, this means that there’s a chance (no matter how low) that the effect might not occur, but it would be unlikely compared to the likelihood of it occurring, depending on the value of probability. We’ll look at the probabilistic context. Mathematically, it would be expressed like the following.

An alternate viewpoint expressed by polymath Judea Pearl is the idea of counterfactual. This idea is expressed in the following equation. Here, ¬cause¬cause means that the cause has not occurred.

So essentially, this means that the probability of effect should be higher when the cause is there, compared to the scenario when the cause does not occur. It might be hard to wrap our heads around, but it simply ties the causation of cause with the effect very strongly.

Example of causation#

Let’s look at an example of an hourly job: the more hours we put in, the more we’re paid. So more hours cause the salary to increase. 

g cluster_cg Hours Hours Salary Salary Hours->Salary
Causal graph between Hours and Salary

This is called a causal graph, representing that hours and salary have a causal relationship. So we can say that hours and salary are correlated as well because we could observe a correlation between the two.

import numpy as np
hours = [1,2,4,8]
salary = [20, 40, 80, 160 ]
cor = np.corrcoef(hours,salary)
print (cor)

As we can see, the correlation is nonzero and positive. So these are correlated. Also the fact that higher hours cause an increase in sales, allows us to connect them as cause and effect. Can we say the same about the previous example of ice cream and T-shirt sales? Let’s look at the following graph.

g cluster_cg Ice Cream Sales Ice Cream Sales T-shirt Sales T-shirt Sales Ice Cream Sales->T-shirt Sales
Causal graph between ice cream sales and T-shirt sales

Common sense dictates that ice cream sales and T-shirt sales are correlated, but they’re not causal. For example, a higher sale of ice creams does not necessarily mean that T-shirt sales will increase. The reality is that during the second and third quarters, sales pick up due to another factor: the temperature increases.

g cluster_cg Temperature Temperature Ice Cream Sales Ice Cream Sales Temperature->Ice Cream Sales T-shirt Sales T-shirt Sales Temperature->T-shirt Sales
Causal graph between ice cream sales and T-shirt sales

Temperature is called the confounding variable because it affects the two variables, and it was not considered in the scenario.

Correlation #

Correlation is the link between two random variables, X and Y. Mathematically, it’s defined as the expectation and variance of random variables X and Y and is called the Pearson correlation coefficient.

Example of correlation#

For example, let’s look at the sales in a shop during different quarters of a year. The quarterly sales of ice creams were 250, 2000, 1000, and 300 during the four quarters, while the corresponding sales of T-shirts were 1500, 5000, 3000, and 700. Now, let’s find the correlation coefficient between the two sales.

Please note that np.corref will output the correlation matrix, which shows the correlation coefficient of all possible combinations. In this blog, we’re using two variables so the following shows what each matrix element represents.

Index 1,1 shows the correlation of the first variable with itself.

Index 2,2 shows the correlation of the second variable with itself.

These two values are equal to 1.

Index 1,2 shows the correlation of the first variable with the second.

Index 2,1 shows the correlation of the second variable with the first.

These are the values of interest to us. They’re both equal.

import numpy as np
ice_creams = [250,2000,1000, 300]
t_shirts = [1500, 5000, 3000, 1700 ]
cor = np.corrcoef(ice_creams,t_shirts)
print (cor)

  • Line 1: We import the NumPy library and give it the alias np. Remember that NumPy is a powerful library for numerical and mathematical operations in Python.

  • Lines 3–4: We define two lists, ice_creams and t_shirts, which contain numerical values representing quantities of ice creams sold and quantities of T-shirts sold on different occasions, respectively.

  • Line 6: This line calculates the correlation coefficient between the two data sets using NumPy’s corrcoef function. It takes ice_creams and t_shirts as input and returns a 2×2 correlation matrix.

  • Line 7: This prints the output matrix from the previous line.

The high value shows the correlation between the two sales is very strong. The correlation coefficient can also show a negative or no correlation. We’ll focus on the case where there’s a correlation. Now execute the following code and look at the graph.

import numpy as np
import matplotlib.pyplot as plt
quarters = [1,2,3,4]
ice_creams = [250,2000,1000, 300]
t_shirts = [1500, 5000, 3000, 1700 ]
fig, axe = plt.subplots(figsize=(7, 3.5), dpi=300)
plt.xlabel('Quarters')
plt.ylabel('Sales')
plt.xticks(range(1,5))
axe.plot(quarters, ice_creams)
axe.plot(quarters, t_shirts)
fig.savefig('output/to.png')
plt.close(fig)

The graph also shows the correlation.

Now, let’s look at another example. In the same store, we look at the respective sales of pullovers and T-shirts.

import numpy as np
pullovers = [2500,100,100, 3000]
t_shirts = [1500, 5000, 3000, 1700 ]
cor = np.corrcoef(pullovers,t_shirts)
print (cor)

As expected, the correlation is negative. This means that pullovers and T-shirts affect each other but in the opposite manner. More pullover sales means fewer T-shirt sales.

An example of zero correlation might need a lot of data, so we would resort to random variables from distributions instead of actual sales data.

import numpy as np
rand_x = np.random.randn(10)
rand_y = np.random.randn(10)
cor = np.corrcoef(rand_x,rand_y)
print (cor)

As we can see, the correlation is close to zero. It should be because we don’t expect numbers drawn randomly from distributions to be correlated. Now that we’re clear about near-zero correlation, we can look at an actual example and see how it shows almost zero correlation. We look at the sales of wallets against T-shirts.

import numpy as np
wallets = [1,3,1, 5]
t_shirts = [1500, 5000, 3000, 1700 ]
cor = np.corrcoef(wallets,t_shirts)
print (cor)

As seen above, we have zero correlation. This means there’s no correlation between the sales of wallets and T-shirts.

Now that we have a clear concept of correlation, let’s look at another important concept.

The above example makes it clear how correlation and causation are similar and different from each other. Causation would mean that there’s correlation, but correlation cannot mean that there’s causation.

Interested in learning more? Check out the Educative course below!

Data Analysis using R for Social Sciences

Cover
Using R for Data Analysis in Social Sciences

With the rapid progress in statistical computing, proficiency in using statistical software such as R, SPSS, and SAS has become almost a universal requirement. The highly extensible R programming language offers a wide range of analytical and graphical capabilities ideal for manipulating large datasets. This course integrates R programming, the logic and steps of statistical inference, and the process of empirical social science research in a highly accessible and structured fashion. It emphasizes learning to use R for essential data management, visualization, analysis, and replicating published research findings. By the end of this course, you’ll be competent enough to use R to analyze data in social sciences to answer substantive research questions and reproduce the statistical analysis in published journal articles.

19hrs 45mins
Intermediate
224 Playgrounds
6 Quizzes


Written By:
Zahid Irfan
Join 2.5 million developers at
Explore the catalog

Free Resources