Loading

Feature degradation | Gmail Email delivery failure

About Salesforce Data 360

Table of Contents

No results

Here are some search tips

Check the spelling of your keywords.
Use more general search terms.
Select fewer filters to broaden your search.

Search all of Salesforce Help

Search all of Salesforce Help

Address Data Issues

You are here:

Address Data Issues

Handle common issues that you can encounter when you prepare data for model building and after you train a model.

Issue	Approach
Extreme Values and Outliers	Predictive algorithms are sensitive to outliers because those values affect averages (means) and standard deviations in statistical significance calculations. If you find unusual values or outliers, confirm whether these data points are relevant and real. Often unusual values are errors. If the extreme data points are accurate, predictable, and reoccurring, don’t remove them unless those points are unimportant.
Incorrect Values	Predictive algorithms assume that the input information is correct. If only a few rows have incorrect values, decide what to do. Remove those rows from the analysis, or replace the incorrect values with more correct or average values. If there are numerous inaccurate values, determine why the inaccuracies happened and whether it’s possible to repair them. Sometimes it’s better to remove a highly error-prone variable than include it in the analysis.
Standardize Categorical Values	For category values, ensure consistent category names. Remove spelling variations (such as plurals or abbreviations). Fix typos and other errors. Use labels that are meaningful, recognizable, and easy to interpret.
High-Cardinality Fields	High-cardinality fields are categorical attributes that contain many distinct values. Examples include names, ZIP codes, or account numbers. Although these variables can be highly informative, high-cardinality attributes are rarely used in predictive modeling. Including these attributes vastly increases the dimensionality of the data, which can make it difficult for most algorithms to build accurate prediction models.
Ordinal Variables	Ordinal variables can be problematic for predictive models. Ordinal data is a type of categorical data that represents the ranking of items within a set, but the intervals between the values aren’t uniform or meaningful. For example, consider the ranking of sales representatives based on the amount sold: 1st, 2nd, 3rd, and so on. The order is clear, but the difference in amount between the 1st and 2nd ranked salesperson is likely different than the difference between the 2nd and 3rd ranked salesperson. Other examples include education levels, satisfaction ratings, and scale responses. If you have ordinal values, consider whether it can be handled as text (categorical) or numeric (continuous). If ordinal data is numeric, consider grouping into meaningful bins such as ratings of 1 and 2 are bad, 3 is neutral and 4 and 5 are good. If ordinal data is text, each value is analyzed and modeled separately.
Duplicate, Redundant, or Highly Correlated Variables	Minimize duplicate, redundant, or other highly correlated variables that carry the same information. Predictive algorithms perform better without these kinds of collinear variables. Collinearity occurs when two or more predictor variables are highly correlated. As a result, one can be linearly predicted from the others with a substantial degree of accuracy. To avoid collinearity, don’t include multiple variables that are highly correlated or data that is from the same reporting hierarchy. For example, customers who live in the city of Tampa also live in the state of Florida. To identify high correlation between two continuous variables, review scatter plots. The pattern of a scatter plot indicates the relationship between variables. The relationship can be linear or nonlinear. To find the strength of the relationship, compute a correlation. Correlation varies between –1 and +1.
Missing Values	The most common repair for missing values is imputing a likely or expected value using a mean or computed value from a distribution. If you use a mean value, you could reduce your standard deviation. Thus, the distribution imputation approach is more reliable. Another approach is to remove records with missing values. Don’t get too ambitious with filtering out missing values. Sometimes the pattern is in the missing data. Also, if you remove too many records, you undermine the real-world aspects in your analysis.

Did this article solve your issue?

Let us know so we can improve!

Loading

Salesforce Help | Article