...

/

Distribution of Predicted Probability and Decile Chart

Distribution of Predicted Probability and Decile Chart

Learn about visualizing model performance with predicted probability distribution and decile chart.

Model Performance Analysis with ROC AUC

The ROC AUC metric is helpful because it provides a single number that summarizes model performance on a dataset. However, it's also insightful to look at model performance for different subsets of the population. One way to break up the population into subsets is to use the model predictions themselves. Using the test set, we can visualize the predicted probabilities with a histogram:

mpl.rcParams['figure.dpi'] = 400
plt.hist(test_set_pred_proba, bins=50)
plt.xlabel('Predicted probability')
plt.ylabel('Number of samples')

This code should produce the following plot:

Press + to interact
Distribution of predicted probabilities for the test set
Distribution of predicted probabilities for the test set

The histogram of predicted probabilities for the test set shows that most predictions are clustered in the range [0, 0.2]. In other words, most borrowers have between a 0 and 20% chance of default, according to the model. However, there appears to be a small cluster of borrowers with a higher risk, centered near 0.7.

A visually intuitive way to examine model performance for different regions of predicted default risk is to create a decile chart, which groups borrowers together based on the decile of predicted probability. Within each decile, we can compute the true default rate. We would expect to see a steady increase in the default rate from the lowest prediction deciles to the highest.

We can compute deciles like we did in Exercise: Randomized Grid Search to Tune XGBoost Hyperparameters, using pandas' qcut:

deciles, decile_bin_edges = pd.qcut(x=test_set_pred_proba,\
q=10,\
retbins=True)

Here we are splitting the predicted probabilities for the test set, supplied with the x keyword argument. We want to split them into ten equal-sized bins, with the bottom 10% of predicted probabilities in the first bin and so on, so we indicate we want q=10 quantiles. However, you can split into any number of bins you want, such as 20 (ventiles) or 5 (quintiles). Because we indicate retbins=True, the bin edges are returned in the decile_bin_edges variable, while the series of decile labels is in deciles. We can examine the 11 bin edges needed to create ten bins:

decile_bin_edges

That should produce this:

array([0.02213463, 0.06000734, 0.08155108, 0.10424594, 0.12708404,
0.15019046, 0.18111563, 0.23032923, 0.32210371, 0.52585585,
0.89491451])

In order to make use of the decile series, we can combine it with the true labels for the test set and the predicted probabilities into a DataFrame:

test_set_df = pd.DataFrame({'Predicted probability':test_set_pred_proba,\
'Prediction decile':deciles,\
'Outcome':y_test_all})
test_set_df.head()

The first few rows of the DataFrame should look like this:

Press + to interact
 DataFrame with predicted probabilities and deciles
DataFrame with predicted probabilities and deciles

In the DataFrame, we can see that each sample is labeled with a decile bin, indicated using the edges of the bin that contains the predicted probability. The outcome shows the true label. What we want to show in our decile chart is the true default rate within the decile bins. For this, we can use pandas' groupby capabilities. First, we create a groupby object, by grouping our DataFrame on the decile column:

test_set_gr = test_set_df.groupby('Prediction decile')

The groupby object can be aggregated by other columns. In particular, here, we're interested in the default rate within decile bins, which is the mean of the outcome variable. We also calculate a count of the data in each bin. Because quantiles, such as deciles, group the population into equal-sized bins, we expect the counts ...

Access this course and 1400+ top-rated courses and projects.