You are here:
Determine Data Requirements
Data preparation is a process of iterative refinement. As you dig deeper into your data, new clues emerge. Discoveries can cause you to reassess previous assumptions and adjust your data prep implementation accordingly.
Einstein Discovery Data Requirements
Einstein Discovery requires a CRM Analytics dataset with at least 3 columns: one outcome variable plus two explanatory or predictor variables. Einstein Discovery supports datasets with up to 50 variables, but solutions typically succeed with far fewer.
To build models, Einstein Discovery requires a CRM Analytics dataset that has at least 400 observations with a known outcome (at least 400 observations have values for the outcome variable). Einstein Discovery supports datasets with up to 20 million observations. To learn more, see Einstein Discovery Capacities and Requirements.
Consider Prior Record States
If you want to capture changes in data over a time period, determine whether your data source keeps only the current state values of a record. Transactional application data sources (Salesforce, for example), contain only the most recent values for a record. Other data sources capture transactional data in a chronological log. Each new version of the record is appended to the log, and previous versions of the record are retained in earlier log entries. Getting a prior value requires storing a snapshot of the historical data, or keeping the prior value data in custom fields in the current record.
Determine the Appropriate Level of Granularity for the Insights You Want
What level of insights are of interest to achieve the objective? For example, customer-level insights are of interest when looking at customer revenue. In the Data Manager, use grouping to adjust the granularity of the data. Choose a granularity that is actionable, understandable, and useful so that you can incorporate the results into your business process or application.
A common mistake is to overly aggregate data. Keep the desired outcome in mind, and use data collected in rows at that level of granularity. Data analyzed in Excel can be at a different level from what you want for Einstein Discovery. For example, to understand the effects of day of the week, provide data at the day level. You cannot predict a day-level outcome from an aggregated, monthly level dataset.
Determine Relevant Time Frames
Statistical analysis datasets can summarize a lifetime of values in just one single row, with many columns that describe different points in time. Don’t collect a lifetime of fields if a specific window of time more accurately reflects the outcome variable you want to analyze and predict. Usually events closest to the outcome are stronger predictors than events that happened a long time ago. Consider a reasonable cut-off time to ensure that your data is sufficiently recent to be relevant.
To use Einstein Discovery for predictions, your variables must be at the point in time on which the prediction is based. For example, suppose that your objective is to decrease defaults on loans by not pre-approving loans that are likely to default. In this case, you capture variables, such as a credit score at the time of loan application and prior. If the person was late on two payments after loan origination, it isn’t used in the pre-approval analysis because they have already been approved.
Decide How Much Data to Get
To build reliable predictive models, provide Einstein Discovery with as much data as possible to resemble real-world distributions of variables. The actual number of records is not always easy to determine because it depends on patterns found in your data. If you have more noise in your data, you need more data to overcome it. Noise in this context means unobserved relationships in the data that the input predictor variables do not capture. In general, more rows of data are better for analysis accuracy. Columns with more possible values result in finer segmentation of the data, but it can require more rows of data for a statistically sound analysis. For example, 10,000 rows with a binary outcome of gender (either male or female) results in potentially 5,000 observations per gender. But 10,000 rows with a variable indicating 50 states results in potentially 200 observations per state.
Consider the Time Series
Data that changes over time must be reflected in your model and also in your associated dataset. When time sequences (Lead Received > Quote Provided > Deal Closed) are important in predictions, proportionally collect data from those different time periods. The key principle is to provide data that reflects what actually happens in the real world at the right level of outcome metric granularity.
Think Proportionally
When collecting data, think about the balance of values for variables in your raw data. For example, how many vertical industry records are there by time period? When extracting a subset of data, include approximately the same proportion of variables in your input dataset. If you provide more records of one variable (vertical, in our example), you can unintentionally introduce bias into your analysis. If you have datasets with millions of rows, it is less likely to encounter accidental bias.
Provide Known Outcomes in Your Data
Einstein Discovery orients its analysis around a particular outcome, typically a key performance indicator (KPI), such as sales margin or opportunity win. Providing data with known outcome values gives Einstein Discovery something to work with. For example, if you're targeting deal win rates, then your data must reflect deals that are definitively won or lost. If the deal is not complete—it is not won or lost—then Einstein omits the deal from analysis because the outcome value is missing.
Consider Bias and Fairness in Your Data
Does the data that you want to use reflect business practices that are possibly biased or unfair? To help you produce ethical and accountable insights and models, Einstein Discovery detects proxy variables and disparate impacts in your dataset. You can also flag and filter sensitive variables (such as gender or age) to see where they show up in your insights. If Einstein Discovery exposes bias in your data, you can simply exclude the biased data from your story. To learn more, see Detect and Remove Bias from a Model. In addition, consider excluding biased data during data prep. For an overview of ethical and accountable AI, take the Responsible Creation of Artificial Intelligence Trailhead module.
Analyze Without Overfitting or Underfitting
Einstein Discovery figures out which variables and combinations of variables best explain the behavior of your chosen metric without overfitting or underfitting:
| Issue | Approach |
|---|---|
| Overfitting | Occurs when using too many variable fields in a predictive model. Overfitting captures the noise in your data with an overly complex, unreliable way so that the model memorizes unnecessary details. When new data comes in, the model fails. To avoid overfitting, exclude variables that are too detailed. |
| Underfitting | Often the result of an excessively simple model. The statistical algorithm cannot capture the underlying patterns in the data. |
Thus, there is a delicate balance between being too specific with too many variables and too vague with not enough selected variables.

