Humans are curious and therefore tend to observe. Most scientific discoveries were the result of curiosity coupled with observations. The next step was to relate factors with each other. It would have taken an immense amount of ingenuity for the Greeks to connect the appearance of tides with the moon’s phases.
Causation is simply saying that one event causes another event. Causation or causality can be expressed in a deterministic or probabilistic context. If we have a cause, we can say it leads to an effect (deterministic context). In the probabilistic case, we say that a cause will increase the likelihood or probability of the effect (probabilistic context). Of course, this means that there’s a chance (no matter how low) that the effect might not occur, but it would be unlikely compared to the likelihood of it occurring, depending on the value of probability. We’ll look at the probabilistic context. Mathematically, it would be expressed like the following.
An alternate viewpoint expressed by polymath Judea Pearl is the idea of counterfactual. This idea is expressed in the following equation. Here,
So essentially, this means that the probability of effect should be higher when the cause is there, compared to the scenario when the cause does not occur. It might be hard to wrap our heads around, but it simply ties the causation of cause with the effect very strongly.
Let’s look at an example of an hourly job: the more hours we put in, the more we’re paid. So more hours cause the salary to increase.
This is called a causal graph, representing that hours and salary have a causal relationship. So we can say that hours and salary are correlated as well because we could observe a correlation between the two.
import numpy as nphours = [1,2,4,8]salary = [20, 40, 80, 160 ]cor = np.corrcoef(hours,salary)print (cor)
As we can see, the correlation is nonzero and positive. So these are correlated. Also the fact that higher hours cause an increase in sales, allows us to connect them as cause and effect. Can we say the same about the previous example of ice cream and T-shirt sales? Let’s look at the following graph.
Common sense dictates that ice cream sales and T-shirt sales are correlated, but they’re not causal. For example, a higher sale of ice creams does not necessarily mean that T-shirt sales will increase. The reality is that during the second and third quarters, sales pick up due to another factor: the temperature increases.
Temperature is called the confounding variable because it affects the two variables, and it was not considered in the scenario.
Correlation is the link between two random variables, X and Y. Mathematically, it’s defined as the expectation and variance of random variables X and Y and is called the Pearson correlation coefficient.
For example, let’s look at the sales in a shop during different quarters of a year. The quarterly sales of ice creams were 250, 2000, 1000, and 300 during the four quarters, while the corresponding sales of T-shirts were 1500, 5000, 3000, and 700. Now, let’s find the correlation coefficient between the two sales.
Please note that
np.corref
will output the correlation matrix, which shows the correlation coefficient of all possible combinations. In this blog, we’re using two variables so the following shows what each matrix element represents.Index 1,1 shows the correlation of the first variable with itself.
Index 2,2 shows the correlation of the second variable with itself.
These two values are equal to 1.
Index 1,2 shows the correlation of the first variable with the second.
Index 2,1 shows the correlation of the second variable with the first.
These are the values of interest to us. They’re both equal.
import numpy as npice_creams = [250,2000,1000, 300]t_shirts = [1500, 5000, 3000, 1700 ]cor = np.corrcoef(ice_creams,t_shirts)print (cor)
Line 1: We import the NumPy library and give it the alias np
. Remember that NumPy is a powerful library for numerical and mathematical operations in Python.
Lines 3–4: We define two lists, ice_creams
and t_shirts
, which contain numerical values representing quantities of ice creams sold and quantities of T-shirts sold on different occasions, respectively.
Line 6: This line calculates the correlation coefficient between the two data sets using NumPy’s corrcoef
function. It takes ice_creams
and t_shirts
as input and returns a 2×2 correlation matrix.
Line 7: This prints the output matrix from the previous line.
The high value shows the correlation between the two sales is very strong. The correlation coefficient can also show a negative or no correlation. We’ll focus on the case where there’s a correlation. Now execute the following code and look at the graph.
import numpy as npimport matplotlib.pyplot as pltquarters = [1,2,3,4]ice_creams = [250,2000,1000, 300]t_shirts = [1500, 5000, 3000, 1700 ]fig, axe = plt.subplots(figsize=(7, 3.5), dpi=300)plt.xlabel('Quarters')plt.ylabel('Sales')plt.xticks(range(1,5))axe.plot(quarters, ice_creams)axe.plot(quarters, t_shirts)fig.savefig('output/to.png')plt.close(fig)
The graph also shows the correlation.
Now, let’s look at another example. In the same store, we look at the respective sales of pullovers and T-shirts.
import numpy as nppullovers = [2500,100,100, 3000]t_shirts = [1500, 5000, 3000, 1700 ]cor = np.corrcoef(pullovers,t_shirts)print (cor)
As expected, the correlation is negative. This means that pullovers and T-shirts affect each other but in the opposite manner. More pullover sales means fewer T-shirt sales.
An example of zero correlation might need a lot of data, so we would resort to random variables from distributions instead of actual sales data.
import numpy as nprand_x = np.random.randn(10)rand_y = np.random.randn(10)cor = np.corrcoef(rand_x,rand_y)print (cor)
As we can see, the correlation is close to zero. It should be because we don’t expect numbers drawn randomly from distributions to be correlated. Now that we’re clear about near-zero correlation, we can look at an actual example and see how it shows almost zero correlation. We look at the sales of wallets against T-shirts.
import numpy as npwallets = [1,3,1, 5]t_shirts = [1500, 5000, 3000, 1700 ]cor = np.corrcoef(wallets,t_shirts)print (cor)
As seen above, we have zero correlation. This means there’s no correlation between the sales of wallets and T-shirts.
Now that we have a clear concept of correlation, let’s look at another important concept.
The above example makes it clear how correlation and causation are similar and different from each other. Causation would mean that there’s correlation, but correlation cannot mean that there’s causation.
Interested in learning more? Check out the Educative course below!
Data Analysis using R for Social Sciences
With the rapid progress in statistical computing, proficiency in using statistical software such as R, SPSS, and SAS has become almost a universal requirement. The highly extensible R programming language offers a wide range of analytical and graphical capabilities ideal for manipulating large datasets. This course integrates R programming, the logic and steps of statistical inference, and the process of empirical social science research in a highly accessible and structured fashion. It emphasizes learning to use R for essential data management, visualization, analysis, and replicating published research findings. By the end of this course, you’ll be competent enough to use R to analyze data in social sciences to answer substantive research questions and reproduce the statistical analysis in published journal articles.
Free Resources