Glossary for Predictive AI

Familiarize yourself with terminology that is commonly associated with predictive AI in AI Models (formerly Einstein Studio).

Actionable Variable: An actionable variable is an explanatory variable that people can control or influence, such as deciding which marketing campaign to use for a particular customer. Contrast these variables with explanatory variables that can’t be controlled, such as a customer’s street address or a person’s age. If a variable is designated as actionable, the model uses prescriptive analytics to suggest actions the user can take to improve the predicted outcome.
Actual Outcome: An actual outcome is the real-world value of an observation's outcome variable after the outcome has occurred. Model performance is calculated by comparing how closely predicted outcomes come to actual outcomes. An actual outcome is sometimes called an observed outcome.
Algorithm: An algorithm is what predictive modeling uses to create a model. Models created in AI Models use one of several algorithms: generalized linear model (GLM) is a regression-based algorithm, while gradient boosting machine (GBM) and XGBoost are decision tree-based machine learning algorithms.
Attribute: See variable.
Auth Header: An authentication header is part of the HTTP request sent by a client to the server when making API requests. It provides information that the server uses to verify the client identity and permissions to access the requested resource.
Binary Classification: The binary classification use case applies to business outcomes that are binary: categorical (text) fields with only two possible values, such as win-lose, pass-fail, public-private, retain-churn, and so on. These outcomes separate data into two distinct groups. For analysis purposes, Einstein converts the two values into Boolean true and false. Logistic regression is used to analyze binary outcomes. Binary classification is one of the main use cases that created models in AI Models support. Compare with Regression.
Cardinality: Cardinality is the number of distinct values in a category. Variables with high cardinality (too many distinct values) can result in complexity that’s difficult to interpret. Created models in AI Models support up to 100 categories per variable. You can optionally consolidate the remaining categories (categories with fewer than 25 observations) into a category called Other. Null values are put into a category called Unspecified.
Categorical Variable: A categorical variable is a type of variable that represents qualitative values (categories). A model that represents a binary classification use case has a categorical variable as its outcome. See category.
Category: A category is a qualitative value that usually contains categorical (text) data, such as Product Category, Lead Status, and Case Subject. Categories are handy for grouping and filtering your data. Unlike numeric variables, you can’t perform math on categories.
Causation: Causation describes a cause-and-effect relationship between things. In predictive modeling, causality refers to the degree to which variables influence each other (or not), such as between variables and an outcome variable. Some variables can have an obvious, direct effect on each other (for example, how price and discount affect the sales margin). Other variables can have a weaker, less obvious effect (for example, how weather can affect on-time delivery). Many variables have no effect on each other: they are independent and mutually exclusive (for example, win-loss records of soccer teams and currency exchange rates). It's important to remember that you can’t presume a causal relationship between variables based simply on a statistical correlation between them. In fact, correlation provides you with a hint that indicates further investigation into the association between those variables. Only with more exploration can you determine whether a causal link between them really exists and, if so, how significant that effect is.
Coefficient: A coefficient is a numeric value that represents the impact that a variable (or a pair of variables) has on the outcome variable. The coefficient quantifies the change in the mean of the outcome variable when there’s a one-unit shift in the variable, assuming all other variables in the model remain constant.
Correlation: A correlation is simply the association—or “co-relationship”—between two or more things. In predictive modeling, correlation describes the statistical association between variables, typically between variables and an outcome variable. The strength of the correlation is quantified as a percentage. The higher the percentage, the stronger the correlation. However, keep in mind that correlation is not causation. Correlation merely describes the strength of association between variables, not whether they causally affect each other.
Count: A count is the number of observations (records) associated with model training. The count can represent all observations in the data, or the subset of observations that meet associated filter criteria.
Date Variable: A date variable is a type of variable that contains date/time (temporal) data.
Dependent Variable: See outcome variable.
Drift: Over time, a deployed model's performance can drift, becoming less accurate in predicting outcomes. Drift can occur due to changing factors in the data or in your business environment. Drift also results from now-obsolete assumptions built into the data on which the model is based. To remedy a model that has drifted, you can refresh it by adjusting settings or retrain it on newer data.
Endpoint URL: The URL that represents the location where a particular API or service can be accessed. It’s where you can make requests to interact with the endpoint of an API.
Ethical Use: Ethical use reflects the application of AI and machine learning for fair and unbiased purposes. With AI Models, it's the practice of producing ethical and accountable models, insights, and predictions. For an overview, take the Responsible Creation of Artificial Intelligence Trailhead module.
Feature: See predictor variable.
Feature Selection: Feature selection involves choosing the optimum set of features (predictor variables) in a model. Ideally, a model contains the number of variables that best explain variations in the outcome variable. A model with too few variables can be too vague to detect underlying patterns in the data, resulting in an underfitting model. A model with too many variables can be overly specific and too complex to filter out noise in the data, resulting in an overfitting model. Successful feature selection includes the most influential variables with no significant lurking variables (important variables that are missing from the model).
Generalized Linear Model (GLM): General Linear Model is an equation-based modeling algorithm used to build a model.
Goal: A goal specifies the desired outcome for your model. A model’s goal includes its outcome variable plus your preferred direction (minimize or maximize) for the outcome. For example, your goal could be to maximize margin or to minimize costs. Predictive modeling uses the model goal to orient its analysis and training. See outcome variable.
Gradient Boosting: Gradient Boosting is a decision tree-based ensemble machine learning algorithm used to build a model. Also called Gradient Boosting Machine (or GBM).
Importance: Importance is the relative influence of a variable on the model's predicted outcome. For models created in AI Models, importance indicates how much the model chooses to use a variable when predicting the outcome. The level of importance is quantified as a percentage. The higher the percentage, the greater the impact. Importance is an advanced metric that considers interactions between variables. If two variables are highly correlated and contain similar information, the model chooses the better variable to use. For example, when predicting energy usage the variables of temperature and number of air conditioner hours are both highly correlated, but only one of the variables receives a high importance score.
Independent Variable: See predictor variable.
Inference: An inference is any type of data output produced by an AI model in AI Models.
k-fold Cross-Validation: Model validation process that randomly divides all the observations in the data into four separate partitions of equal size. Next, it completes four test passes (folds) in which three of the partitions serve as the training set and one partition serves as the test set. Metrics are averaged across all four folds.
Key Based Authentication: Authentication that uses an API key to authorize access to an API endpoint. API keys are a common method of securing APIs and controlling access to specific resources.
Leakage: Leakage occurs when the data used to train your model includes one or more variables that contain the information that you’re trying to predict. This can result in models that are extremely accurate when, in actuality, they are problematic. To remedy data leakage, remove any variables from your model that are causing the leakage.
Linear Regression: Linear regression is an analytical technique used for numeric use cases.
Logistic Regression: Logistic regression is an analytical technique used for the binary classification use case.
Lurking Variable: A lurking variable is a variable that is missing from your model but that significantly explains variations in the outcome variable.
Mean: A mean is the statistical average: the sum of all items divided by the number of items.
Model: A model is the sophisticated, custom representation based on a comprehensive, statistical understanding of past outcomes used to predict future outcomes. A model accepts the values of one or more predictor variables as input and produces a predicted outcome as output, along with top factors and improvements (if requested).
Model Builder: The model builder is the tool in AI Models used to create, connect, and edit models.
Modeling Algorithm: A modeling algorithm is what AI Models uses to create a model. Einstein uses one of several algorithms: generalized linear model (GLM) is a linear algorithm, while gradient boosting machine (GBM) and XGBoost are decision tree-based machine learning algorithms.
Model Performance: Model performance are the metrics used to describe the performance of the predictive model. These metrics (quality indicators, which are sometimes called fit statistics) show how well the model's predictions fit the training data in the dataset. For definitions of quality indicators shown in Model Performance, see Evaluate Model Quality.
Noise: Noise is any data that doesn’t meaningfully explain variations in your outcome variable. See signal.
Numeric Variable: A numeric variable is a type of variable that represents quantitative values (numbers), such as revenue or price. You can do math on numeric variables, such as calculating the total revenue or the average price. A numeric value always has an associated unit of measure, such as currency, volume, or weight. A model that represents a numeric use case has a numeric outcome variable.
Observation: An observation represents an instance of the data. An observation is analogous to a row of data in a table, or a record in Salesforce. For example, if your model’s goal is to maximize opportunity wins, then each observation represents an opportunity.
Outcome: An outcome is the business result. An outcome is typically a key performance indicator (KPI), such as sales margin or opportunity wins.
Outcome Variable: In predictive modeling, the outcome variable is the field used as the single, primary focus for analysis and predictions. The goal of a model is to maximize or minimize its outcome variable. An outcome variable is sometimes referred to as the response, the target variable, or the dependent variable. See goal.
Outlier: If Einstein detects outliers in your data, it means that a variable contains data points that are unusually distant from the average value (more than five times the standard deviation from the mean for that variable). Uncommonly large or small numbers, potentially from data entry errors or rare events, affect averages (means) and standard deviations, which can reduce the accuracy of insights or predictions. Outliers can be selectively excluded from a model.
Overfitting: In predictive modeling, overfitting occurs when a model performs well in predicting outcomes on the training data in the data, but less well when predicting outcomes for novel, or unseen data, such as production data. Using too many variables can result in an overly complex predictive model that captures the noise in your data. To mitigate overfitting, created models in AI Models use ridge regression and regularization. See also underfitting.
Performance: For predictive models, performance is a qualitative measure of how accurately a model predicts outcomes. For created models in AI Models, use the model’s training metrics to evaluate performance.
Predicted Outcome: A predicted outcome is the result that a predictive model estimates or forecasts based on input data and learned patterns. It represents the model’s best guess for the target variable.
Prediction: In AI Models, a prediction is a derived value (produced by a model) that represents a predicted outcome. You can think of a prediction as the output of a predictive model.
Predictive Model: See model.
Predictive Modeling: Predictive modeling is the practice of analyzing historical and current data, based on AI, machine learning, predictive modeling, and statistical techniques to identify patterns and predict probabilistic future outcomes. Predictive modeling is sometimes called predictive analytics or predictive AI.
Predictor or Predictor Variable: A variable that a model expects as input. A prediction request passes values for each predictor variable that the model requires. Based on the provided input values, the model's equation produces a prediction as an output. Predictors are also known as features and independent variables.
Prescription: A prescription is a suggested action that can improve the likelihood of a desired outcome. Prescriptions are associated with actionable variables, which are explanatory variables that people can control. Taking a suggested action can improve the predicted outcome.
R²: R² measures a regression's model's ability to explain variation in the outcome. It represents the proportion of the variance in the outcome variable that is predictable from one or more variables. In general, the higher the R², the better the model predicts outcomes. R² is a commonly used metric for regression use cases that predict numeric values.
Ranked Data: Ranked data is used to distribute data by probability. Also known as cumulative fraction, ranked data is often presented as deciles or quantiles. In training metrics for created models in AI Models, gain and lift charts are plotted on an x-axis of percentage of ranked data. For example, a ranked data of 0.1 equates to the top decile, or the 10% of records with the highest scores.
Regression: A regression applies to model outcome variables that are numeric. Predicting a number field is a regression problem with its own set of metrics to measure model quality. In predictive modeling, linear regression is used for numeric outcomes. The numeric use case is one of the main use cases that created models in AI Models support.
Request: An endpoint request is a client's request to a specific URL or endpoint on a server. This request includes the HTTP method, headers, parameters, and, if applicable, a request body.
Residual: A residual is the mathematical difference between the observed (or actual) value and the predicted value. It's calculated as residual=observed value-predicted value. Residuals are also known as errors, and are used to assess model quality and accuracy. A positive residual means the prediction was too low, while a negative residual means the prediction was too high.
Response: An endpoint response is a server's reply to a client’s request. The request includes an HTTP status code indicating the success or failure of the request, headers providing metadata about the response, and, if applicable, a response body containing the requested data or additional information.
Retriever: A retriever returns relevant facts from textual data, which is indexed in the vector database, to augment a Large Language Model (LLM) prompt. By augmenting prompts with accurate, current, and pertinent information, retrievers improve the value and relevance of LLM responses for the user.
Ridge Regression: Ridge regression is a regularization approach that created models in AI Models use to mitigate model overfitting by preventing coefficients from getting too large.
Sampling: Technique of randomly selecting a subset of observations to analyze for the purpose of reducing the time needed to analyze the data. The sample should be large enough to be sufficiently representative of the variability in the data.
Score: (noun) A prediction associated with an observation. (verb) The process of predicting outcomes for a set of observations.
Secret Key: A secret cryptographic key that’s associated with an API endpoint. The key authenticates and authorizes endpoint requests.
Segment: A segment is a subset of observations (rows) that meet the criteria specified in the segment filter. See segmentation.
Segmentation: Segmentation involves filtering data to focus predictions on a particular group, such as a customer type or region.
Sensitive Variable: A sensitive variable contains data that could potentially be associated with unfair treatment. Some examples are variables associated with race, gender, religion, national origin, sexual orientation, disability, or age. Less obvious examples include proxy variables, such as street address or ZIP code, which can reflect discriminatory practices.
Signal: Signal is an indication of a statistically significant and potentially meaningful pattern in your data. For example, an insight can describe a high correlation between a variable and the outcome variable. By investigating the relationship further, you can learn whether the correlation helps explain variations in the outcome (possible signal) or not (possible noise). Sometimes referred to as a hint.
Sync Type: The method used to synchronize data between different systems, devices, or applications. Synchronization is the process of maintaining consistency between two or more datasets.
Terminal State: Data that is finalized and not expected to change. An example of finalized data is the date on which an order shipped. A record that has reached its terminal state represents an actual outcome (also called observed outcome). Define the conditions under which your model’s outcome variable has attained its terminal state.
Text Variable: See categorical variable, binary outcome.
Threshold: In a binary classification model, the threshold value tells your model how to classify a binary outcome. For created models in AI Models, if the calculated probability is above the threshold value, Einstein classifies the outcome one way (such as True or Positive). If the calculated probability is below the threshold value, Einstein classifies the outcome the other way (such as False or Negative). The default threshold is 0.5, but you can tune this value up or down to accommodate your use case. The threshold is sometimes called the Classification Threshold or Decision Threshold.
Tokens: The units of text that the large language models (LLMs) process and generate. Tokens act as a bridge between the raw text data and numerical representations that LLMs can work with. Examples of units include individual characters, words, subwords, or larger linguistic units.
Top Predictors: Top predictors are the conditions that most significantly drive the predicted outcome, in decreasing order of magnitude. A condition is a data value associated with a column. In created models in AI Models, a predictor consists of one or two conditions. See predictor variables.
Training Set: In predictive modeling, the training set is the portion of the data that is used to train the model to make predictions. See also: validation set.
Underfitting: In predictive modeling, underfitting occurs when a model performs poorly in predicting outcomes on the training data in the dataset. Underfitting is often a result of an excessively simple model in which there aren't enough variables for a statistical algorithm to capture the underlying patterns in the data. See also overfitting.
Unstructured Text: Free-form text that varies in content and length. Examples include customer comments, survey feedback, social media postings, text messages, and emails. Contrast with categorical variable.
Validation Set: In predictive modeling, the validation set is the portion of the data used to validate the predictions generated by the trained model. See also: training set.
Variable: A variable represents a characteristic of the data you’re analyzing. A variable is analogous to a column in a table or a field in a Salesforce object. For example, an opportunity has variables—such as the opportunity type, lead source, fiscal year, lead source, expected amount—that describe properties associated with each opportunity. Each variable has one data type (number, text, or date). In predictive modeling, relationships among two types of variables are analyzed: outcome variables and variables. Data scientists sometimes refer to variables as attributes or features.
XGBoost: XGBoost, or extreme gradient boosting, is an extension of GBM that’s optimized for efficiency. XGBoost is a decision tree-based, ensemble machine learning algorithm where groups of decision trees are built sequentially to better fit the data, while avoiding overfitting.

Glossary for Predictive AI

See Also

General Information

Required Cookies

Functional Cookies

Advertising Cookies

General Information

Required Cookies

Functional Cookies

Advertising Cookies

Cookie List

Product Area

Feature Impact

Edition

Experience

Glossary for Predictive AI

See Also