Splitting Time Series Data

Explore how to correctly split time series data for training, validation, and testing without causing data leakage. Understand the limitations of random splitting, learn how to use cut-off points, and apply time series-specific cross-validation techniques with Python's TimeSeriesSplit to effectively evaluate forecasting models.

We'll cover the following...

Motivation
Sequential split with cut-off points
Time series cross-validation
- Time series split with the temperatures data

Motivation

When we develop machine learning models, we split our data into train, validation, and test sets to evaluate their ability to generalize over unseen data. Most techniques used to divide the data into these sets have something in common: The split is random. In other words, data points are assigned to either one of the three sets randomly. Contrary to standard practice, in time series, we cannot do that. This is due to the sequential nature of the time series.

Time series forecasting is based on the principle that the future will be similar to the past. This principle, however, would be broken if we trained our models on randomly selected data points. The reason is that we could end up training a model using data that happened after the data that we are going to test it on. This is a type of situation called data leakage. To put it in a different way, we might be using tomorrow’s temperature to predict yesterday’s temperature, which is obviously impossible in real life. Our model would never encounter this situation in a production scenario.

Sequential split with cut-off points

We can still avoid the dangers of data leakage and apply a rigorous split strategy. The easiest way to split the data is to select a cut-off point. Points prior to that point will go to the train set; points after the cut-off will be left out for testing. A similar logic applies if we want to create a validation set with two cut-off points, as shown in the diagram below.

1.Introduction to Time Series

2.The Basics of Time Series

3.Exploring Data

Project

4.The Properties of Time Series

5.ARIMA Models

6.On Prediction

7.Choosing, Fitting, and Evaluating Models

8.Conclusion

9.Appendix

Assessment

Project

Mock Interview

Splitting Time Series Data

Motivation

Sequential split with cut-off points