Data leakage in machine learning

Data leakage is a phenomenon that occurs when your model learns from data that shouldn’t be a part of the training data set or data that wouldn’t be available in a real-life​ scenario. It is most​ common when your data set already has the information that you’re trying to predict.

Time series forecasting

Data leakage is a common phenomenon in time series forecasting, i.e., where the data points follow a chronological order.

Depending on the nature of the data set, it is possible that the target variable has a distribution that is very similar for both data sets (the training and the test). However, such a case may not hold true in real-life scenarios. The model can learn how the probability of each target variable changes according to the moment in time. Thus, any feature included in the data set, that is related to time, may be​ a potential threat of data leakage.

Therefore, the first approach to counter data leakage in time series forecasting is to remove all the features that relate to time.

Copyright ©2024 Educative, Inc. All rights reserved