You are here:
Data Leakage Alert
Data leakage occurs when the model's training data contains the information you’re trying to predict. Leakage results in models that score optimistically high in training but perform less accurately on live data. It can mislead the model into learning patterns that are unavailable in practice, thereby compromising predictions.
Actions to Consider
Investigate the variable for data leakage. If found, exclude the variable from the model.
Detection Methodology
Model Builder displays an alert when it detects an explanatory variable that always has the same outcome.
Example
A lender wants to predict if a customer will default on a loan. To achieve this, the lender builds a binary model with these input variables.
- age
- income
- credit score
- loan amount
- loan status
- payment history
- loan final payment date
After training, the model displays an alert because data leakage was detected. The values for the “loan status” and “payment history” variables are determined only after the loan is approved and partially repaid. Including such information leaks future data that wouldn’t realistically be available when the model makes predictions. To resolve the issue, here are some actions to consider.
- Exclude the “loan status", “payment history", and “loan final payment date” input variables from the training set.
- Use variables known at the time of the loan approval instead of “age", “income", “credit score", and “loan amount".
- Retrain the model with an updated dataset.

