Binary classification metrics with logistic regression and near-default options

Now we proceed to fit an example model to illustrate binary classification metrics. We will continue to use logistic regression with near-default options. The following code loads the model class and creates a model object.

from sklearn.linear_model import LogisticRegression 

example_lr = LogisticRegression(C=0.1, class_weight=None, 
                                dual=False, fit_intercept=True,
                                intercept_scaling=1, max_iter=100, 
                                multi_class='auto', n_jobs=None, 
                                penalty='l2', random_state=None, 
                                solver='liblinear', tol=0.0001, 
                                verbose=0, warm_start=False)

Now we proceed to train the model, as you might imagine, using the labeled data from our training set. We proceed immediately to use the trained model to make predictions on the features of the samples from the held-out test set:

example_lr.fit(X_train, y_train)
LogisticRegression(C=0.1, solver='liblinear')
# LogisticRegression(C=0.1, solver='liblinear')
y_pred = example_lr.predict(X_test)

Understanding the limitations of accuracy

We’ve stored the model-predicted labels of the test set in a variable called y_pred. How should we now assess the quality of these predictions? We have the true labels, in the y_test variable. First, we will compute what is probably the simplest of all binary classification metrics: accuracy. Accuracy is defined as the proportion of samples that were correctly classified.

One way to calculate accuracy is to create a logical mask that is True whenever the predicted label is equal to the actual label, and False otherwise. We can then take the average of this mask, which will interpret True as 1 and False as 0, giving us the proportion of correct classifications:

is_correct = y_pred == y_test
np.mean(is_correct)
# 0.7834239639977498

This indicates that the model is correct 78% of the time. While this is a pretty straightforward calculation, there are actually easier ways to calculate accuracy using the convenience of scikit-learn. One way is to use the trained model’s .score method, passing the features of the test data to make predictions on, as well as the test labels. This method makes the predictions and then does the same calculation we performed previously, all in one step. Or, we could import scikit-learn’s metrics library, which includes many model performance metrics, such as accuracy_score. For this, we pass the true labels and the predicted labels:

example_lr.score(X_test, y_test)
# 0.7834239639977498
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred)
# 0.7834239639977498

Get hands-on with 1300+ tech skills courses.