Information Leakage

Learn how information leakage can produce machine learning models that overfit in this lesson.

What is information leakage?

Information leakage occurs when a machine learning algorithm has access to information about future data during the training process. Information leakage produces models with better predictions than expected, leading to metrics (e.g., accuracy) that overestimate a model’s usefulness.

Test holdout sets and cross-validation simulate future data by withholding the information contained in the data during model training (e.g., validation folds in cross-validation). There are two common sources of information leakage in practice:

  • Column-based feature engineering is performed on all data before splitting it into train / test datasets.

  • Column-based feature engineering is performed on the complete training dataset before cross-validation.

Essential ideas to remember to prevent information leakage are:

  • Any data / information in the test dataset cannot be used in training the model.

  • With cross-validation, any data / information that will ultimately be included in validation folds cannot be used to train the model.

  • Row-based feature engineering doesn’t cause information leakage.

  • Column-based feature engineering is a common source of information leakage, especially in the case of cross-validation.

The solution to avoiding information leakage from the test dataset is simple: always split the data into training and test sets before performing any feature engineering. However, avoiding information leakage when using cross-validation is more complicated.

Cross-validation information leakage

To illustrate how information leakage can happen with cross-validation, take the example of the following feature engineering code using the Titanic training dataset. The code creates an AvgFareByPclass feature and then joins it to the original training dataset.

Get hands-on with 1200+ tech skills courses.