You are here:
Multicollinearity Alert
Multicollinearity indicates that two or more variables in a dataset are highly correlated (for example, “city” and “postal code”). Because these variables can have a duplicate impact on the outcome, high collinearity can lead to overfitting—the model performs well on training data but could perform poorly on data that the model hasn't been exposed to yet.
Actions to Consider
To improve results, choose just one variable. Use the most descriptive field (for example, “city”) to make insights more easily interpretable.
Detection Methodology
Model Builder displays a data alert when the Cramér’s V algorithm that tests for multicollinearity, returns a value of 0.5 or higher for two variables.
Example
A real estate agency wants to predict house prices for new listings. To achieve this, the agency builds a regression model with these input variables.
- house size (sq. ft)
- number of bedrooms
- number of bathrooms
- age of house
- neighborhood median income
- renovation status
After model training, an alert displays because the ”house size,” “number of bedrooms,” and “number of bathrooms” variables are highly correlated. As a result, the model can’t determine the importance of each variable, leading to unreliable coefficients. To resolve the issue, here are some actions to consider.
- Exclude one of the correlated variables to improve generalization on data that isn’t yet trained. For example, include “house size” and exclude "number of bedrooms" from the dataset.
- Use variable selection techniques, such as principal component analysis (PCA) or lasso regression, to reduce redundancy.
- Retrain the model with an updated dataset.

