Normality, the Q-Q Plot and the Jarque-Bera Test

Learn two ways to detect when data is normally distributed: the Q-Q plot and the Jarque-Bera test.

We'll cover the following...

Motivation
The Q-Q plot
- Q-Q plot in statsmodels
The Jarque-Bera test
- The Jarque-Bera test on statsmodels

Motivation

Normality is a fairly common assumption in linear models. The white noise component in many models, such as in the ARIMA family, is generally assumed to be normally distributed. This ensures that parameter estimates are well-behaved, i.e., we can be confident that, with a large enough sample, those estimates converge to the actual parameters of the data-generating process.

Since normality is so important, it is only natural that we need to search for it in our data. We can do this using visual methods or with statistical tests. Let’s explore both options.

The Q-Q plot

In theory, we could use any visual method to assess whether the distribution of a series follows a Gaussian distribution or not: We just need to overlay what it would ideally look like and what it actually looks like. However, some types of charts are better than others for this purpose. One of the best charts to compare the distribution of a series against the normal distribution is the Q-Q plot.

The Q-Q plot (Quantile-Quantile plot) is a scatterplot that focuses on the tails of our data. On one of the axes of the plot (normally the x-axis), we would have the theoretical quantiles of the distribution that we think our data follows. On the other axis, we would have the actual quantiles of the distribution of our data. The idea is that the points in the chart should roughly form a 45º line if our data is, in fact, distributed as we assumed. Normally, the extreme quantiles (the tails) will indicate to us if the theoretical and empirical distributions are very far off: they are the ones that might deviate from the 45º line.

Think of the quantiles in the Q-Q plot as the values in a distribution that divides the sample into equal intervals. For instance, the median is a quantile that divides the sample into two halves. We typically use percentiles in a Q-Q plot, but any type of quantile could do. To interpret the Q-Q plot, consider each point a one-to-one comparison between the empirical and the theoretical quantile. For instance, imagine that the percentile 90 in our sample is the value 7. Assume that, if our data followed a normal distribution, percentile 90 should be determined by value 5. This means that the point (5,7) in the plot will be above the 45º line. It also means that our empirical distribution has longer tails than the normal distribution.

Introduction to Time Series

The Basics of Time Series

Exploring Data

Analyze Time Series Data Using Markov Transition Fields

The Properties of Time Series

ARIMA Models

On Prediction

Choosing, Fitting, and Evaluating Models

Conclusion

Appendix

What have you learned?

Time Series Forecasting with Prophet in Python

Normality, the Q-Q Plot and the Jarque-Bera Test

Motivation

The Q-Q plot

Q-Q plot in statsmodels