Information Leakage
Explore the concept of information leakage in machine learning, focusing on how improper feature engineering can cause overly optimistic model performance. Understand why splitting data before feature engineering is critical, the challenges posed by cross-validation, and how tools like tidymodels help prevent leakage to ensure more reliable models.
We'll cover the following...
What is information leakage?
Information leakage occurs when a machine learning algorithm has access to information about future data during the training process. Information leakage produces models with better predictions than expected, leading to metrics (e.g., accuracy) that overestimate a model’s usefulness.
Test holdout sets and cross-validation simulate future data by withholding the information contained in the data during model training (e.g., validation folds in cross-validation). There are two common sources of information leakage in practice: