Binary classification is one of the most common machine learning tasks, encountered in numerous practical applications.
However, in practice, the goal of such tasks often extends beyond simply predicting a class. What becomes much more important is the model's ability to estimate the probability of an object belonging to one class or another. In other words, we are interested not only in which class to choose but also in how confident the model is in its decision.
Such tasks are quite frequent. For example, in credit scoring, there is a task of estimating the probability of client default — predicting whether a client will stop paying their loan. Banks use such models to make decisions based on the calculated default probabilities: whether to issue a loan and, if so, under what terms. In this context, precise probability estimation emerges as a pivotal factor shaping financial outcomes.
But how can we determine the accuracy of the model's predictions? Traditional metrics such as accuracy, recall, or F-measure are not suitable for such tasks. Specialized tools are needed to assess the quality of probability predictions.
In this article, I will share practical experience in evaluating probabilistic predictions, discuss the key metrics used in practice, and explain how to interpret them and what purposes they are best suited for.
Let us consider a dataset with l
observations:
And let’s assume we have trained a binary classification model:
that predicts p_i
, the probability that y_i = 1
for the object x_i
.
Let’s try to evaluate the quality of probability predictions for such a classifier. What properties should an ideally predicted probability possess?
First, probabilities should effectively rank objects by their likelihood of belonging to a specific class. This means that an object with characteristics of class "1" should have a higher probability of belonging to this class than an object that lacks those characteristics.
Second, probabilities should be calibrated, meaning they should align with the true frequency of events. Calibration implies that the model’s predictions reflect the actual likelihood of an event. For instance, if the model predicts a probability of 0.8 for a group of objects, then 80% of those objects should indeed belong to the positive class. A calibrated model, therefore, not only ranks objects effectively but also provides meaningful and interpretable probability predictions.
Evaluates how much the predicted probabilities p_i
deviate from the true labels y_i
. The metric is calculated as follows:
The lower the Log Loss, the better the model predicts probabilities: it ranges from 0 (perfect predictions) to infinity (confident but incorrect predictions). The metric reaches its minimum when, for unambiguous objects, the model predicts probabilities close to 1 for the correct class and close to 0 for the others. For objects with characteristics of both classes, the probabilities should reflect their uncertainty, such as being closer to 0.5.
Log Loss correlates well with other probability evaluation metrics but is sensitive to outliers and can be challenging to interpret. For instance, a Log Loss value of 0.8 cannot always be definitively classified as "good" or "bad."
Application: Log Loss is suitable for comparing models (the one with the lower value is preferred) but is less helpful for assessing the absolute quality of predictions made by a single model.
from sklearn.metrics import log_loss
import numpy as np
# Example of true class labels (0 or 1)
y_true = [0, 1, 1, 0, 1]
# Example of predicted probabilities of belonging to class 1
y_pred_proba = [0.1, 0.9, 0.8, 0.3, 0.6]
# Calculation of Log Loss
logloss = log_loss(y_true, y_pred_proba)
print(f"Log Loss: {logloss}")
One of the most popular metrics is ROC-AUC.
The ROC Curve (Receiver Operating Characteristic curve) is a graph where the X-axis corresponds to the False Positive Rate (FPR), and the Y-axis corresponds to the True Positive Rate (TPR).
The ROC curve always passes through the points (0,0) and (1,1). It starts at (0,0) when all objects are classified as class "0" and reaches (1,1) when all objects are classified as class "1."
If the curve is closer to the top-left corner, it indicates that the model performs well - correctly separating the classes with minimal errors. If the model's predictions are random, the curve will follow the diagonal from (0,0) to (1,1). Conversely, if the model frequently confuses the classes (e.g., labeling class "0" as "1" and vice versa), the curve will approach the bottom-right corner. To derive a numerical metric from this, the area under the ROC curve is calculated.
ROC-AUC (Area Under the ROC Curve) is the area under the ROC curve.
The larger this area, the better the model. ROC-AUC can take values between 0 and 1.
ROC-AUC evaluates the model's ability to correctly rank objects based on their likelihood of belonging to a class, but it does not assess the calibration of probabilities — i.e., how well the predicted probabilities match the true event frequencies. For example, multiplying all probabilities by 1000 will result in values that are no longer valid probabilities, but the ROC-AUC will remain unchanged because scaling probabilities by a constant does not alter the ranking of objects.
In cases of severe class imbalance (e.g., when class "1" constitutes only a small percentage of the dataset), ROC-AUC may overestimate the model's quality, as rare false positives have little impact on the final score.
Application: ROC-AUC is used to evaluate how well a model ranks objects by their class membership probability. It is not suitable for assessing the calibration quality of predicted probabilities.
import numpy as np
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
# Example of true class labels
y_true = [0, 0, 1, 1, 0, 1, 1, 0, 1, 0]
# Example of predicted probabilities of belonging to class 1
y_pred_proba = [0.1, 0.4, 0.35, 0.8, 0.2, 0.7, 0.6, 0.3, 0.9, 0.5]
# Plotting the ROC curve
fpr, tpr, thresholds = roc_curve(y_true, y_pred_proba)
# Calculation of AUC
roc_auc = roc_auc_score(y_true, y_pred_proba)
# Plotting the graph
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random guess')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.grid()
plt.show()
PR-Curve (Precision-Recall Curve) is a graph that shows the relationship between Precision and Recall for different classification thresholds.
To derive a numerical metric from the curve, the area under the PR curve (Precision-Recall AUC) is calculated.
PR-AUC (Area Under the PR-Curve) is the area under the PR curve.
PR-AUC ranges from 0 to 1. The higher the area, the better the model.
PR-AUC shares similarities with ROC-AUC as both measure the quality of ranking objects by class membership. However, PR-Curve focuses on a single class, ignoring the other. PR-Curve is more suitable in cases where:
Application: PR-AUC is used to evaluate the quality of ranking objects by their likelihood of belonging to a class, particularly when one class is much rarer than the other or when focus needs to be on a single class. It is not used to assess the calibration of predicted probabilities.
import numpy as np
from sklearn.metrics import precision_score, recall_score, precision_recall_curve, auc
import matplotlib.pyplot as plt
# Example of data
y_true = np.array([0, 1, 1, 0, 1, 0, 1, 0, 0, 1]) # True labels (0 or 1)
y_scores = np.array([0.1, 0.9, 0.8, 0.3, 0.6, 0.4, 0.7, 0.2, 0.1, 0.85]) # Predicted probabilities for the positive class
# 1. Calculation of Precision and Recall for a fixed threshold
threshold = 0.5 # Example of a threshold
y_pred = (y_scores >= threshold).astype(int) # Predictions based on the threshold
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
print(f"Precision (при пороге {threshold}): {precision:.2f}")
print(f"Recall (при пороге {threshold}): {recall:.2f}")
# 2. Plotting the PR curve
precision_vals, recall_vals, thresholds = precision_recall_curve(y_true, y_scores)
# Calculation of PR-AUC
pr_auc = auc(recall_vals, precision_vals)
print(f"PR-AUC: {pr_auc:.2f}")
# 3. Plotting the PR curve
plt.figure(figsize=(8, 6))
plt.plot(recall_vals, precision_vals, label=f'PR Curve (AUC = {pr_auc:.2f})', linewidth=2)
plt.xlabel('Recall', fontsize=12)
plt.ylabel('Precision', fontsize=12)
plt.title('Precision-Recall Curve', fontsize=14)
plt.legend(loc='best')
plt.grid(True)
plt.show()
Reliability Diagram (Calibration Curve) and Expected Calibration Error (ECE) are tools for assessing the calibration of probabilistic models. They are used to analyze how well a model predicts outcome probabilities.
A Calibration Curve is a plot that shows the relationship between predicted probabilities and the actual frequency of successes.
Average predicted probability:
The proportion of true positive outcomes (empirical probability) is calculated as:
In an ideal model, all points on the Calibration Curve lie on the diagonal y = x. This means that if the model predicts a probability of 0.7, the event actually occurs in 70% of cases. If sections of the curve are above the diagonal (as shown in the example graph), it indicates that the model underestimates the actual probability of events. Conversely, if sections of the curve are below the diagonal, the model overestimates probabilities, meaning it is overly confident in its predictions.
The Calibration Curve visualizes how well the model predicts probabilities across different ranges. It helps identify which groups of objects the model struggles with the most and in which direction the errors occur. For example, in the example graph, the model significantly underestimates probabilities for objects it most confidently classifies as class "1." This suggests that while it assigns these objects to class "1," it is not sufficiently confident in its predictions.
The Calibration Curve does not provide insight into how well the model ranks objects by class membership. Therefore, it should be used in conjunction with an ROC or PR curve.
Another important limitation of the Calibration Curve lies in its construction. Since the algorithm involves splitting objects into bins based on predicted probabilities, there may be cases where individual points on the curve are based on a small number of objects. This reduces the statistical significance of the predicted and empirical probability estimates, making these sections of the curve less reliable.
From the points on the Calibration Curve, the Expected Calibration Error (ECE) metric can be calculated.
Expected Calibration Error (ECE) measures how much the model's predictions deviate from actual probabilities. It is a scalar value that aggregates the difference between the predicted probability and the observed frequency of successes.
The lower the ECE, the better the model's calibration. The metric can range from "0" (for a perfect model) to "1" (for a poorly calibrated model).
Application: The Calibration Curve is used to evaluate the calibration of probabilistic models. It helps analyze how accurately the model predicts outcome probabilities. Since the metric is less sensitive to the quality of object ranking, it should be used alongside the ROC curve or PR curve.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve
# Example of data (replace with your data)
y_true = np.random.randint(0, 2, size=1000) # True labels (0 or 1)
y_pred = np.random.rand(1000) # Predicted probabilities
# Number of bins for grouping
n_bins = 10
# Plotting the Reliability Diagram
prob_true, prob_pred = calibration_curve(y_true, y_pred, n_bins=n_bins, strategy='uniform')
# Reliability Diagram plot
plt.figure(figsize=(8, 6))
plt.plot(prob_pred, prob_true, marker='o', label='Calibration Curve')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Perfect Calibration')
plt.title('Reliability Diagram (Calibration Curve)')
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives')
plt.legend()
plt.grid()
plt.show()
# Calculation of Expected Calibration Error (ECE)
def compute_ece(y_true, y_pred, n_bins=10):
"""Calculates Expected Calibration Error (ECE)"""
bins = np.linspace(0, 1, n_bins + 1)
bin_indices = np.digitize(y_pred, bins, right=True)
ece = 0.0
for i in range(1, n_bins + 1):
bin_mask = bin_indices == i
bin_size = bin_mask.sum()
if bin_size > 0:
bin_confidence = y_pred[bin_mask].mean()
bin_accuracy = y_true[bin_mask].mean()
ece += (bin_size / len(y_true)) * abs(bin_accuracy - bin_confidence)
return ece
# Calculation of ECE (Expected Calibration Error)
ece = compute_ece(y_true, y_pred, n_bins=n_bins)
print(f"Expected Calibration Error (ECE): {ece:.4f}")
Hosmer-Lemeshow Curves and Statistics are tools for assessing the calibration of probabilistic models and visualizing predicted probabilities.
These tools are not commonly found in articles or literature and may have different names in various sources. In practice, the curves are often referred to as a Gain Chart, and the statistic is known as the Hosmer-Lemeshow Test. However, this tool has proven effective in practice and is arguably one of the most informative methods for visualizing the quality of predicted probabilities.
Sort the Objects:
Arrange the objects in ascending order of the probability of class "1" predicted by the model.
Divide into Bins:
Split the dataset into 10 bins of equal size.
Calculate Average Probability and Frequency
For each bin, calculate the following:
The average predicted probability:
The proportion of true positive outcomes (empirical probability) is calculated as:
Plotting the curves:
Two graphs are created — one showing the dependence of P_g on the bin number and the other showing the dependence of E_g on the bin number.
Since all bins are of equal size, there is no issue with the statistical significance of the calculated E_g and P_g.
The analysis is based on how well the curves of the average predicted probability and the proportion of observed events align with each other. Here's how to determine if the model is good:
These graphs help evaluate how well the model predicts probabilities and identify potential issues.
For instance, in the example above, the curve for a good model is shown: it is concave downward, and the bins are sorted in ascending order. In the first bin, the empirical probability ( E_g ) is close to 0, and in the last bin, it is close to 1, confirming the model's confidence in class separation. However, there are some noticeable issues:
From the first to the seventh bin, there is an underestimation of predicted probabilities, indicating calibration issues for objects with lower class "1" probabilities.
In the fourth bin, the proportion of class "1" objects ( E_g ) is lower than in the previous bins.
This behavior may indicate data anomalies, labeling errors, or feature distribution peculiarities. These bins should be examined separately to understand the cause.
The Hosmer-Lemeshow statistic is built based on these graphs and is calculated as follows:
The Hosmer-Lemeshow statistic is compared with the critical value of the chi-squared distribution with B-2 degrees of freedom. However, it is rarely used in practice and is more commonly applied as a metric for comparing two models rather than as a statistical test.
Application: Hosmer-Lemeshow curves are an excellent tool for visualizing the quality of probability predictions. They help evaluate how the classifier predicts probabilities and identify problematic areas.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.isotonic import IsotonicRegression
# Generation of data with correlation between predictions and true values
np.random.seed(42)
base_probs = np.random.rand(1000)
y_true = np.random.choice([0, 1], size=1000, p=[0.7, 0.3])
correlated_probs = np.where(y_true == 1, base_probs + 0.3, base_probs - 0.3)
pred_probs = np.clip(correlated_probs, 0, 1)
# Creating a DataFrame
df = pd.DataFrame({
'y_true': y_true,
'pred_probs': pred_probs
})
# Applying isotonic regression for calibration
iso_reg = IsotonicRegression(out_of_bounds='clip')
calibrated_probs = iso_reg.fit_transform(df['pred_probs'], df['y_true'])
# Updating data with calibrated predictions
df['pred_probs_calibrated'] = calibrated_probs
# Reordering data by calibrated probability
df_sorted_calibrated = df.sort_values(by='pred_probs_calibrated').reset_index(drop=True)
# Splitting into equal-sized bins (deciles)
df_sorted_calibrated['bin'] = pd.qcut(df_sorted_calibrated.index, q=10, labels=False)
# Counting statistics for each bin
bin_stats_calibrated = df_sorted_calibrated.groupby('bin').agg(
mean_predicted_prob=('pred_probs_calibrated', 'mean'), # Average calibrated probability
count_class_1=('y_true', 'sum'), # Number of objects in class "1"
total_count=('y_true', 'count') # Total number of objects in the bin
).reset_index()
# Adding the proportion of true positive results
bin_stats_calibrated['empirical_prob'] = bin_stats_calibrated['count_class_1'] / bin_stats_calibrated['total_count']
# Plotting the graph for calibrated probabilities
plt.figure(figsize=(12, 7))
# Histogram of empirical probability
plt.bar(bin_stats_calibrated['bin'], bin_stats_calibrated['empirical_prob'], alpha=0.6, label="Empirical probability", color='blue')
# Curve of average calibrated probability.
plt.plot(bin_stats_calibrated['bin'], bin_stats_calibrated['mean_predicted_prob'], marker='o', label="Average predicted probability (calibration)", color='orange')
# Setting up the axes and legend.
plt.xlabel('Bin number')
plt.ylabel('Probability')
plt.title('Average calibrated probability and empirical probability by bins')
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
bin_stats_calibrated
Depending on the goals of your analysis, different metrics can be used. However, for a comprehensive understanding of how the model predicts probabilities, it’s best to consider all of them. Here’s how they can be applied:
This approach allows for a holistic evaluation of the model, identifying its strengths and weaknesses, and making informed decisions about the quality of your model.