You are here:
Address Common Data Issues
Handle common issues that you encounter during data prep for Einstein Discovery.
The following table describes some approaches for handling common data issues when preparing data for analysis.
| Issue | Approach | See Also |
|---|---|---|
| Extreme Values and Outliers | Einstein Discovery algorithms are sensitive to outliers because those values affect averages (means) and standard deviations in statistical significance calculations. If you find unusual values or outliers, confirm whether these data points are relevant and real. Often unusual values are errors. If the extreme data points are accurate, predictable, and reoccurring, do not remove them unless those points are unimportant. You can reduce outlier influence by using transformations or converting the numeric variable to a categorical value with binning. | |
| Missing Values | The most common repair for missing values is imputing a likely or expected value using a mean or computed value from a distribution. If you use a mean value, you could reduce your standard deviation. Thus, the distribution imputation approach is more reliable. Another approach is to remove records with missing values. Don’t get too ambitious with filtering out missing values. Sometimes the pattern is in the missing data. Also, if you delete too many records, you undermine the real-world aspects in your analysis. Einstein Discovery orients its analysis around a particular outcome. If the value of the targeted outcome is missing from a particular row, then Einstein Discovery excludes that row from analysis. |
|
| Incorrect Values | Predictive algorithms assume that the input information is correct. If only a few rows have incorrect values, decide what to do. Remove those rows from the analysis, or replace the incorrect values with more correct or average values. If there are numerous inaccurate values, determine why the inaccuracies happened and whether it’s possible to repair them. Sometimes it’s better to remove a highly error-prone variable than include it in the analysis. | |
| Standardize Categorical Values | For category values, ensure consistent category names. Remove spelling variations (such as plurals or abbreviations). Fix typos and other errors. Use labels that are meaningful, recognizable, and easy to interpret. | |
| Skewed Data | For continuous variables, review the distributions, central tendency, and spread of the variable. These variables are measured using various statistical visualization methods. Confirm that continuous variables are normally distributed. If not, try to reduce skewness for optimal prediction. For categorical variables, use a frequency table, along with a bar chart, to understand distributions of each category. If variable values are skewed, Einstein Discovery could produce biased models. When a skewed distribution must be corrected, transform the variable using a function, such as the Box-Cox transformation. After applying the fix, a normal distribution for the variable is achieved. The newly prepared, transformed variable performs much better for predictive modeling purposes. | |
| High-Cardinality Fields | High-cardinality fields are categorical attributes that contain many distinct values. Examples include names, ZIP codes, or account numbers. Although these variables can be highly informative, high-cardinality attributes are rarely used in predictive modeling. Including these attributes vastly increases the dimensionality of the dataset, which can make it difficult for most algorithms to build accurate prediction models. | |
| Binary Outcomes and Boolean Variables | If a variable has a binary outcome (only two possible values), and those values are represented by numbers (for example, 1 and 0), then convert those numeric values to text values (for example, "TRUE" and "FALSE" or "NOTCHURNED" and "CHURNED"). For solutions that implement the classification use case (binary outcomes), Einstein Discovery requires the outcome values to be represented as text values. For other variables in the dataset, converting these values to text can improve the interpretability of the charts and explanations in the resulting insights. | |
| Ordinal Variables | Ordinal variables are problematic for predictive models. Ordinal data consists of numerical scores on an arbitrary scale that is designed to show ranking in a set of data points. For example, Low, Medium, and High are ordinal. Predictive algorithms assume that the variable is an interval or ratio variable and therefore be misled or confused by the scale. Ordinal variables are treated as categorical. If you have ordinal values, transform them into continuous or categorical values. | |
| Duplicate, Redundant, or Highly Correlated Variables | Minimize duplicate, redundant, or other highly correlated variables that carry the same information. Einstein Discovery algorithms perform better without these kinds of collinear variables. Collinearity occurs when two or more predictor variables are highly correlated. As a result, one can be linearly predicted from the others with a substantial degree of accuracy. To avoid collinearity, do not include multiple variables that are highly correlated or data that is from the same reporting hierarchy. For example, customers who live in the city of Tampa also live in the state of Florida. To identify high correlation between two continuous variables, review scatter plots. The pattern of a scatter plot indicates the relationship between variables. The relationship can be linear or nonlinear. To find the strength of the relationship, compute correlation. Correlation varies between –1 and +1. |

