Binary Classification Metrics

You are here:

Binary Classification Metrics

Metrics for binary classification help evaluate the performance of a model that categorizes data into two classes.

Accuracy

In addition to the overall accuracy score using AUC, there are two additional metrics for understanding the accuracy of a binary classification model.

Model accuracy tile

View how often the model makes correct and incorrect predictions (1). See the threshold cutoff point for how predictions are classified between classes (2).

Confusion Matrix

Use the confusion matrix to evaluate the tradeoffs between different error types based on the threshold value. The chart displays how many times the model correctly and incorrectly classifies observations at the associated threshold.

Confusion matrix chart

ROC Curve

The Receiver Operating Characteristic (ROC) curve displays the performance measurement at various threshold settings. ROC is a probability curve and AUC (Area Under the Curve) quantifies the degree of separability. Use the chart to see how effectively the model can differentiate between classes.

ROC curve chart

Gain and Lift

Gain and Lift charts show the benefit of the model. Using a portion of the data that’s scored and ranked for analysis, the charts measure results obtained with the model compared to random guessing without a model. The greater the gain and the higher the lift, the more effective the model.

ROC curve chart

Chart	Description
Gain	The gain chart plots the total positive rate, or gain, by percentage of the data. The closer the model line is to theoretical exactness (perfect model) and the further it is from random guessing (no model), the greater the gain. Gain can be used to prioritize your organization’s resources. For example, if a model has 80% gain at 20% of the data, then 80% of the target can be reached with the top 20% of the data.
Lift	The lift chart plots the improvement ratio, or lift, by percentage of the data. Better models have higher lifts. For example, if a model has 2.5 lift at 20% of the data, then results with the model are 2.5 times better in the top 20% of the data than without.

Chart

Description

Gain

The gain chart plots the total positive rate, or gain, by percentage of the data. The closer the model line is to theoretical exactness (perfect model) and the further it is from random guessing (no model), the greater the gain. Gain can be used to prioritize your organization’s resources.

For example, if a model has 80% gain at 20% of the data, then 80% of the target can be reached with the top 20% of the data.

Lift

The lift chart plots the improvement ratio, or lift, by percentage of the data. Better models have higher lifts.

For example, if a model has 2.5 lift at 20% of the data, then results with the model are 2.5 times better in the top 20% of the data than without.

4-Fold Cross Validation Results

The 4-fold cross-validation approach mitigates sampling bias during the model validation process. In this method, the data is randomly divided into four separate partitions of equal size, and the model undergoes four test passes (folds). During each pass, three partitions serve as the training data, while the remaining one serves as the test data. By completing four test passes, each partition is used once as the validation data and three times as part of the training data, ensuring a comprehensive evaluation. Refer to the table of validation results to examine metrics corresponding to each fold of the data.

Cross validation metrics table

Metric	Description
Number of records	Total number of observations. The meaning of a value varies per column. For the Training Data and Validation Data columns, the numbers are the same. This value represents the total number of observations in the entire data used in the creation of the model. For the Fold #1 through Fold #4 columns, this value represents how many observations fell in that fold (approximately 25% of the entire data).
AUC	The Area Under the Curve (AUC) represents the rate of correct classification by a logistic model. 0.5 means that the model performs no better than random guessing. 1.0 means that the model correctly classifies data 100% of the time, which can indicate data leakage.
GINI	The Gini Index quantifies how closely this logistic model performs to a theoretically best possible model.

Metric

Description

Number of records

Total number of observations. The meaning of a value varies per column.

For the Training Data and Validation Data columns, the numbers are the same. This value represents the total number of observations in the entire data used in the creation of the model.
For the Fold #1 through Fold #4 columns, this value represents how many observations fell in that fold (approximately 25% of the entire data).

AUC

The Area Under the Curve (AUC) represents the rate of correct classification by a logistic model.

0.5 means that the model performs no better than random guessing.
1.0 means that the model correctly classifies data 100% of the time, which can indicate data leakage.

GINI

The Gini Index quantifies how closely this logistic model performs to a theoretically best possible model.

Other Metrics

Consider other metrics that are commonly used to evaluate model quality.

Metric	Description
Accuracy	Accuracy measures the proportion of outcomes that the model predicted correctly (true positives and true negatives). Use to evaluate the overall classification performance of a model. The range is from 0 to 1, with a higher value indicating better performance. It's calculated as `(True Negative+True Positive)/(True Negative+False Negative+True Positive+False Positive).`
F1 Score	F1 score is the harmonic average of the positive predictive value (precision) and the true positive rate (recall). Use to evaluate the overall performance of a binary classification model, particularly when it's equally important to minimize false positives and false negatives. The range is from 0 to 1, with a higher value indicating better performance. It’s calculated as 2`(Positive Predicted ValueTrue Positive Rate)/(Positive Predicted Value+True Positive Rate)`.
False Negatives	The number of predicted negatives that are actually positive.
False Negative Rate	False Negative Rate (FNR, also called type II error or miss rate) is the proportion of predicted false negatives among all the actual positives. Use to evaluate how often a classification model incorrectly classifies positives as negatives, or when it's important to minimize false negative errors. The range is from 0 to 1, with a lower value indicating better performance. It's calculated as `False Negative/(False Negative+True Positive)`.
False Positives	The number of predicted positives that are actually negative.
False Positive Rate	False Positive Rate (FPR, also called type I error, false alarm ratio, or fallout) is the number of predicted false positives among all the actual negatives. Use to evaluate how often a classification model incorrectly classifies negatives as positives, or when it's important to minimize false positive errors. The range is from 0 to 1, with a lower value indicating better performance. It's calculated as `False Positive/(False Positive+True Negative)`.
Informedness	Informedness (also called Youden's J statistic) measures how well the model predicts both positives and negatives. Use to evaluate the overall performance of a binary classification model, particularly when it's equally important to classify true positives and true negatives. The range is from -1 to 1, with 1 indicating perfect performance, 0 indicating random performance, and -1 indicating perfect inverse performance. It's calculated as `True Positive Rate+True Negative Rate-1`.
Markedness	Markedness measures the trustworthiness of positive and negative predictions by the model. Use to evaluate the overall performance of a binary classification model, particularly when it's important to separately assess the performance for positives and negatives. The range is from -1 to 1, with 1 indicating perfect performance, 0 indicating random performance, and -1 indicating perfect inverse performance. It's calculated as `Positive Predicted Value+Negative Predicted Value-1`.
MCC	The Matthews Correlation Coefficient (MCC) provides a more even representation of the four parts of the confusion matrix than other metrics. Use to evaluate overall performance, particularly when there's imbalanced data. The range is from -1 to 1, with 1 indicating perfect performance, 0 indicating random performance, and -1 indicating perfect inverse performance. It's calculated as `(True PositiveTrue Negative-False PositiveFalse Negative )/square root((True Positive+False Positive)(True Positive+False Negative)(True Negative+False Positive)*(True Negative+False Negative))`.
Negative Predictive Value	Negative Predictive Value (NPV) is the proportion of actual negatives among all the predicted negatives. Use to evaluate how well a classification model predicts negative instances, or when it's important to minimize false negatives. The range is from 0 to 1, with a higher value indicating better performance. It's calculated as `True Negative/(True Negative+False Negative)`.
Positive Predictive Value (Precision)	Positive Predictive Value (PPV, also called precision) is the proportion of actual positives among all the predicted positives. Use to evaluate how well a classification model predicts positive instances, or when it's important to minimize false positives. The range is from 0 to 1, with a higher value indicating better performance. It's calculated as `True Positive/(True Positive+False Positive)`.
True Negatives	The number of predicted negatives that are actually negative.
True Negative Rate (Specificity)	True Negative Rate (TNR, also called specificity) is the proportion of predicted negatives among all the actual negatives. Use to evaluate how often a classification model correctly classifies negatives, or when it's important to correctly identify negative instances. The range is from 0 to 1, with a higher value indicating better performance. It's calculated as `True Negative/(True Negative+False Positive)`.
True Positives	The number of predicted positives that are actually positives.
True Positive Rate (Sensitivity, Recall)	True Positive Rate (TPR, also called sensitivity or recall) is the proportion of predicted positives among all the actual positives. Use to evaluate how often a classification model correctly classifies positives, or when it's important to correctly identify positive instances. The range is from 0 to 1, with a higher value indicating better performance. It's calculated as `True Positive/(True Positive+False Negative)`.

Did this article solve your issue?

Let us know so we can improve!

Binary Classification Metrics

Accuracy

Confusion Matrix

ROC Curve

Gain and Lift

4-Fold Cross Validation Results

Other Metrics

General Information

Required Cookies

Functional Cookies

Advertising Cookies

General Information

Required Cookies

Functional Cookies

Advertising Cookies

Cookie List

Product Area

Feature Impact

Edition

Experience

Binary Classification Metrics

Accuracy

Confusion Matrix

ROC Curve

Gain and Lift

4-Fold Cross Validation Results

Other Metrics