The Response Variable and Concluding the Initial Exploration

We have now looked through all the features to see whether any data is missing, as well as to examine them generally. The features are important because they constitute the inputs to our machine learning algorithm. On the other side of the model lies the output, which is a prediction of the response variable. For our problem, this is a binary flag indicating whether or not a credit account will default next month.

Binary classification and proportions of classes

The key task for the case study project is to come up with a predictive model for this target. Because the response variable is a yes/no flag, this problem is called a binary classification task. In our labeled data, the samples (accounts) that defaulted (that is, 'default payment next month' = 1) are said to belong to the positive class, while those that didn’t belong to the negative class.

The main piece of information to examine regarding the response of a binary classification problem is this: what is the proportion of the positive class? This is an easy check.

Before we perform this check, we load the packages we need with the following code:

import numpy as np #numerical computation
import pandas as pd #data wrangling
import matplotlib.pyplot as plt #plotting package
#Next line helps with rendering plots
%matplotlib inline
import matplotlib as mpl #additional plotting functionality
mpl.rcParams['figure.dpi'] = 400 #high res figures

Now we load the cleaned version of the case study data like this:

df = pd.read_csv('Chapter_1_cleaned_data.csv')

Now, to find the proportion of the positive class, all we need to do is get the average of the response variable over the whole dataset. This has the interpretation of the default rate. It’s also worthwhile to check the number of samples in each class, using groupby and count in pandas. This is presented as follows:

Get hands-on with 1300+ tech skills courses.