You are here:
Considerations for Preparing Data
Consider these guidelines when preparing your data for Einstein Discovery.
Optimize Your Data for Analysis
To get the best results from Einstein Discovery's comprehensive analytical capabilities, your dataset must contain the highest-quality data possible. To optimize data for machine learning, modeling, and AI:
- Your data must be accurate.
- Your data must be complete.
- Your data must sufficiently represent your real-world business operations in terms of quantity (volume) and variation (diversity).
- Your data must be relevant to the business outcome you want to analyze or predict.
In reality, data is imperfect, especially when you first begin working with it. You run across missing and incorrect values, spelling variations and other inconsistencies, outliers that are correct or not, duplicate and redundant information, and other issues that can obscure the operational reality of what your data represents.
With data prep, you can remedy these faults in your data so that Einstein Discovery consumes the best version of the truth. The result? More relevant insights and higher-quality models.
Assess Your Source Data and Correct Issues at the Source
Assess the condition of your source data. As you collect the data into variables, profile the values. Look for data problems, such as extremes, outliers, missing values, incorrect values, skew, and high cardinality. Common data preparation issues are identified during the data-loading process.
We recommend that you address data quality issues as early as possible. You can repair them in CRM Analytics, in the source system, or in your data preparation process. If you are seeing errors from source applications, a best practice is to resolve the issue at the source system instead of during data preparation.
Consolidate Data from Multiple Sources
Using CRM Analytics tools like Data Prep, you can pull data stored in a dimensional data warehouse or in a transactional database format. If so, use record identifiers or primary keys to join fields from multiple tables to create a single, unified, flattened view. Your view contains an outcome variable, along with input predictor variables collected at a level of analytical granularity on which you can make actionable decisions.
For many outcome variables, data is captured at various business process steps in multiple data sources. For example, a sales process can have data in a CRM, an email marketing program, and Excel spreadsheet, and an accounting system. If that is the case, identify the fields in those systems that can link the different data sources together.
Ensure that Observations are Independent
Einstein Discovery algorithms assume that each observation is independent and is not related to other observations. If relationships exist between observations, create a variable within the row of data to capture that behavior. For example, if the same Opportunity has multiple competitors, don’t prepare multiple rows of data with the same Opportunity ID. Instead, create more fields on one Opportunity ID and indicate whether each of the top-10 competitors were present in the deal.
Calculate Durations for Date Values
Dates can be rolled up to duration values and used as input for analysis. If a business process has multiple key dates, use the Data Manager to create multiple variables in which to store numeric durations (for example, Days between Lead to Last Contact and Days between Demo to Trial). Common date variable rollups include the earliest date and the most recent date. Time durations can also be represented in either absolute or relative form.
Maximize Interpretability for Insights
When preparing your data, consider the downstream effects of your decisions. Keep in mind the importance of making charts and explanations easier for users to review and interpret. For example, for categories, keep the number of unique values low (too many categories create cluttered bar charts) and use consistent spelling.
Apply CRM Analytics Considerations To Your Data
Review the following articles under Considerations Before Integrating Data into Datasets:

