Home/Blog/Data Science/Causation vs. Correlation

Causation vs. Correlation

5 min read

Mar 22, 2024

content

Causation

Example of causation

Correlation

Example of correlation

Causation is simply saying that one event causes another event. Causation or causality can be expressed in a deterministic or probabilistic context. If we have a cause, we can say it leads to an effect (deterministic context). In the probabilistic case, we say that a cause will increase the likelihood or probability of the effect (probabilistic context). Of course, this means that there’s a chance (no matter how low) that the effect might not occur, but it would be unlikely compared to the likelihood of it occurring, depending on the value of probability. We’ll look at the probabilistic context. Mathematically, it would be expressed like the following.

Example of correlation#

For example, let’s look at the sales in a shop during different quarters of a year. The quarterly sales of ice creams were 250, 2000, 1000, and 300 during the four quarters, while the corresponding sales of T-shirts were 1500, 5000, 3000, and 700. Now, let’s find the correlation coefficient between the two sales.

Please note that np.corref will output the correlation matrix, which shows the correlation coefficient of all possible combinations. In this blog, we’re using two variables so the following shows what each matrix element represents.
Index 1,1 shows the correlation of the first variable with itself.
Index 2,2 shows the correlation of the second variable with itself.
These two values are equal to 1.
Index 1,2 shows the correlation of the first variable with the second.
Index 2,1 shows the correlation of the second variable with the first.
These are the values of interest to us. They’re both equal.

Line 1: We import the NumPy library and give it the alias np. Remember that NumPy is a powerful library for numerical and mathematical operations in Python.
Lines 3–4: We define two lists, ice_creams and t_shirts, which contain numerical values representing quantities of ice creams sold and quantities of T-shirts sold on different occasions, respectively.
Line 6: This line calculates the correlation coefficient between the two data sets using NumPy’s corrcoef function. It takes ice_creams and t_shirts as input and returns a 2×2 correlation matrix.
Line 7: This prints the output matrix from the previous line.

The high value shows the correlation between the two sales is very strong. The correlation coefficient can also show a negative or no correlation. We’ll focus on the case where there’s a correlation. Now execute the following code and look at the graph.

Data Analysis using R for Social Sciences

Using R for Data Analysis in Social Sciences

With the rapid progress in statistical computing, proficiency in using statistical software such as R, SPSS, and SAS has become almost a universal requirement. The highly extensible R programming language offers a wide range of analytical and graphical capabilities ideal for manipulating large datasets. This course integrates R programming, the logic and steps of statistical inference, and the process of empirical social science research in a highly accessible and structured fashion. It emphasizes learning to use R for essential data management, visualization, analysis, and replicating published research findings. By the end of this course, you’ll be competent enough to use R to analyze data in social sciences to answer substantive research questions and reproduce the statistical analysis in published journal articles.

19hrs 45mins

Intermediate

224 Playgrounds

6 Quizzes

Written By:

Zahid Irfan