The Receiver Operating Characteristic (ROC) Curve
Learn about the significance of the ROC curve and the area under it.
Deciding on a threshold for a classifier is a question of finding the “sweet spot” where we are successfully recovering enough true positives, without incurring too many false positives. As the threshold is lowered more and more, there will be more of both. A good classifier will be able to capture more true positives without the expense of a large number of false positives. What would be the effect of lowering the threshold even more, with the predicted probabilities from the previous exercise? It turns out there is a classic method of visualization in machine learning, with a corresponding metric that can help answer this kind of question.
Understanding the ROC curve
The receiver operating characteristic (ROC) curve is a plot of the pairs of TPRs (y-axis) and FPRs (x-axis) that result from lowering the threshold down from 1 all the way to 0. You can imagine that if the threshold is 1, there are no positive predictions because a logistic regression only predicts probabilities strictly between 0 and 1 (endpoints not included). Because there are no positive predictions, the TPR and the FPR are both 0, so the ROC curve starts out at (0, 0).
As the threshold is lowered, the TPR will start to increase, hopefully faster than the FPR if it’s a good classifier. Eventually, when the threshold is lowered all the way to 0, every sample is predicted to be positive, including all the samples that are, in fact, positive, but also all the samples that are actually negative. This means the TPR is 1 but the FPR is also 1. In between these two extremes are the reasonable options for where you may want to set the threshold, depending on the relative costs and benefits of true and false positives and negatives for the specific problem being considered. In this way, it is possible to get a complete picture of the performance of the classifier at all different thresholds to decide which one to use.
We could write the code to determine the TPRs and FPRs of the ROC curve by using the predicted probabilities and varying the threshold from 1 to 0. Instead, we will use scikit-learn’s convenient functionality, which will take the true labels and predicted probabilities as inputs and return arrays of TPRs, FPRs, and the thresholds that lead to them. We will then plot the TPRs against the FPRs to show the ROC curve. Run this code to use scikit-learn to generate the arrays of TPRs and FPRs for the ROC curve, importing the metrics
module if needed:
from sklearn import metrics
fpr, tpr, thresholds = metrics.roc_curve(y_test, pos_proba)
Now we need to produce a plot. We’ll use plt.plot
, which will make a line plot using the first argument as the x values (FPRs), the second argument as the y values (TPRs), and the shorthand '*-'
to indicate a line plot with star symbols where the data points are located. We add a straight-line plot from (0, 0) to (1, 1), which will appear in red ('r'
) and as a dashed line ('--'
). We’ve also given the plot a legend (which we’ll explain shortly), as well as axis labels and a title. This code produces the ROC plot:
plt.plot(fpr, tpr, '*-')
plt.plot([0, 1], [0, 1], 'r--')
plt.legend(['Logistic regression', 'Random chance'])
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
And the plot should look like this:
Get hands-on with 1400+ tech skills courses.