221 reads

Population Stability Index in Machine learning models

by Varun NakraMay 2nd, 2024

Too Long; Didn't Read

Population Stability Index (PSI) is a statistical measure commonly used in credit risk modeling and other fields. It's particularly valuable in scenarios where changes in the population distribution could impact the effectiveness of a model. The model performance relies on the closeness between the model development sample and the production data.

featured image - Population Stability Index in Machine learning models

Population Stability Index (PSI) is a statistical measure commonly used in credit risk modeling and other fields to assess the stability of a population across different time periods or segments. It's particularly valuable in scenarios where changes in the population distribution could impact the effectiveness of a model. It is a fundamental assumption that the population on which a statistical model is applied to is similar to the population on which the model is developed at. The model performance relies on the closeness between the model development sample and the production data. Therefore, there is an intrinsic need to compare the two samples (development vs production) and assess whether a model needs to be recalibrated. The population change may occur for several reasons – change in the business strategies and policies or changes in external factors such as the political, economic and social environment. The model performance is impacted because of a change in the joint distribution p(y, X), where y is the model outcome and X is the design matrix that comprises the observations of the model features. Therefore, the entire model fit, i.e., the estimates of the model coefficients and the inherent functions get impacted. If this happens, then a decision must be made towards the recalibration or redevelopment of the model. Before we dive deep into the PSI mathematical expression, we start with some key concepts on Entropy and KL divergence.

Entropy and KL Divergence

Entropy is a measure of uncertainty or randomness in a set of data. In machine learning, particularly in decision trees and information theory, entropy is used to quantify the impurity or disorder in a dataset. For instance, the splits of a decision tree for a classification problem are created to minimize the entropy. Entropy is used to determine the best attribute to split the data at each node of the decision tree.

Entropy

In the above definition of entropy, the negative sign ensures that Entropy is positive (because log of the probability will be negative as the probability will be < 1).

Higher (lower) the value of P(X = X_i) lower (higher) is the entropy or randomness or uncertainty about X. Thus, ‘skewed’ distributions have lower entropy but ‘balanced’ distributions have higher entropy. The objective of splitting a variable at a node for a binary classification problem is to create more skewed distribution such that we can discriminate between an outcome value y = 1 and outcome value y = 0. Therefore, we move from a high entropy stage before splitting the variable to a lower entropy stage after splitting the variable.
If Entropy H(X) is high, then the uncertainty before knowing X is high. So after knowing X we have gained more information.
If Entropy H(X) is low, then the uncertainty before knowing X is low. So after knowing X we have gained less information.
We try to minimize entropy whilst building binary regression models. For example, we ‘bin’ or ‘discretize’ a model factor such the distribution of binary outcome becomes as skewed as possible across the different bins.

Kullback-Liebler Divergence (KL Divergence) has been referred in several places in the literature (Lin, 2017), (Wu & Olson, 2010), (Li et al., 2008), etc. It measures the difference between two probability distributions p(x) and q(x). It quantifies how one probability distribution diverges from another. The distribution p(x) is considered the ‘true’ or ‘base’ distribution, and the distribution q(x) is considered the ‘untrue’ or ‘target’ distribution, so that KL Divergence represents some sort of loss due to using the ‘wrong’ distribution.

KL divergence is non-negative and is zero if and only if p(x) and q(x) are identical, i.e., the same distribution. KL divergence measures the dissimilarity of the two probability distributions p(x) and q(x). If both the distributions are identical, KL Divergence is 0.

We have a single random variable “X” and we have two different probability distributions for that variable. One is assumed to be the ‘true’ distribution and the other is the approximation of the ‘true’ distribution. So, we try “approximating” one distribution with the other and calculate the difference in the information about the event as a measure of the “divergence”. Note that the notation ‘P||Q’ implies P is approximated as Q.
Notice how the mathematical expressions for the individual terms of KL divergence are similar to that of Entropy or information which we introduced earlier. The second term inside the brackets is the information in p(x) about x. The first term inside the brackets is the information in an approximation distribution of p(x) as q(x) about x. That is if we approximate the target distribution p(x) by q(x); then the difference in the information between them is the relative entropy or KL Divergence index. The better our approximation the closer p(x) and q(x) and hence the KL divergence will be 0.
Even though KL Divergence measures divergence of q(x) from p(x), it is technically not a distance measure because the definition is not symmetric. Thus KL(P||Q) != KL(Q||P)

Population Stability Index –

Now that we have covered the mathematical expressions of Entropy and KL Divergence Index, we will see how they would help us in understanding the expression of Population Stability Index.

Notice that we have added another term q(x) inside the brackets of KL divergence expression to get the mathematical expression for PSI. But the formulae look so similar that we can think of expressing PSI as a function of KL Divergence. Here is how we can do that.

That is the PSI metric is actually the sum of the relative entropies of q(x) over p(x) and p(x) over q(x). In other words we are adding the information ‘gain’ by expressing p(x) using an approximation q(x) and vice-versa. The reason we do this is because we don’t know the ‘ground truth’. We assume that one of them is a ‘true’ distribution and the other is an ‘approximation’; and as we know that the KL divergence is not symmetric, we compute it from p(x) to q(x) and vice-versa and add them to get a final metric which can be judged to ascertain the difference between p(x) and q(x).

The PSI calculation typically begins by dividing the data into bins or segments based on key variables used in the model.
PSI values close to zero indicate stability between the initial and new populations.
Higher PSI values suggest significant shifts in the population distribution, which may require further investigation or model recalibration.

So how high should PSI be to make that judgement? The following table summarizes the PSI values which are ‘industry wide rules of thumb’.

Interestingly, PSI is very similar to ‘Information Value’ used to measure the strength of the modelling variables in developing credit risk models. We will explore that soon in a different article.