Examining Relationships Between Features and Response Variable
Learn to examine the relationship between features and response variables.
In order to make accurate predictions of the response variable, good features are necessary. We need features that are clearly linked to the response variable in some way. Thus far, we’ve examined the relationship between a couple of features and the response variable, either by calculating the groupby
/mean
of a feature and the response variable, or using individual features in a model and examining performance. However, we have not yet done a systematic exploration of how all the features relate to the response variable. We will do that now and begin to capitalize on all the hard work we put in when we were exploring the features and making sure the data quality was good.
Using correlation for exploring feature relations
A popular way of getting a quick look at how all the features relate to the response variable, as well as how the features are related to each other, is by using a correlation plot. We will first create a correlation plot for the case study data, then discuss how to interpret it, along with some mathematical details.
In order to create a correlation plot, the necessary inputs include all features that we plan to explore, as well as the response variable. Because we are going to use most of the column names from the DataFrame for this, a quick way to get the appropriate list in Python is to start with all the column names and remove those that we don’t want. As a preliminary step, we start a new notebook for this section and load packages and the cleaned data from the “Data Exploration and Cleaning” chapter, with this code:
import numpy as np #numerical computation
import pandas as pd #data wrangling
import matplotlib.pyplot as plt #plotting package
#Next line helps with rendering plots
%matplotlib inline
import matplotlib as mpl #add'l plotting functionality
import seaborn as sns #a fancy plotting package
mpl.rcParams['figure.dpi'] = 400 #high res figures
df = pd.read_csv('Chapter_1_cleaned_data.csv')
Notice that this notebook starts out in a very similar way to the previous section’s notebook, except we also import the Seaborn package, which has many convenient plotting features that build on Matplotlib. Now let’s make a list of all the columns of the DataFrame and look at the first and last five:
features_response = df.columns.tolist()
features_response[:5]
# ['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE']
features_response[-5:]
# ['EDUCATION_CAT', 'graduate school', 'high school', 'others', 'university']
Filtering features using list comprehension
Recall that we are not using the gender
variable due to ethical concerns, and we learned that PAY_2
, PAY_3
,…, PAY_6
are incorrect and should be ignored. Also, we are not going to examine the one-hot encoding we created from the EDUCATION
variable, because the information from those columns is already included in the original feature, at least in some form. We will just use the EDUCATION
feature directly. Finally, it makes no sense to use ID
as a feature, because this is simply a unique account identifier and has nothing to do with the response variable. Let’s make another list of column names that are neither features nor the response. We want to exclude these from our analysis:
items_to_remove = ['ID', 'SEX', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'EDUCATION_CAT',\
'graduate school', 'high school', 'none', 'others', 'university']
To have a list of column names that consists only of the features and response we will use, we want to remove the names in items_to_remove
from the current list contained in features_response
. There are several ways to do this in Python. We will use this opportunity to learn about a particular way of building a list in Python, called a list comprehension. When people talk about certain constructions as being Pythonic, or idiomatic to the Python language, list comprehensions are often one of the things that are mentioned.
What is a list comprehension? Conceptually, it is basically the same as a for
loop. However, list comprehensions enable the creation of lists, which may be spread across several lines in an actual for
loop, to be written in one line. They are also slightly faster than for
loops, due to optimizations within Python. While this likely won’t save us much time here, this is a good chance to become familiar with them. Here is an example list comprehension:
example_list_comp = [item for item in range(5)]
example_list_comp
# [0, 1, 2, 3, 4]
That’s all there is to it.
We can also use additional clauses to make the list comprehensions flexible. For example, we can use them to reassign the features_response
variable with a list containing everything that’s not in the list of strings we wish to remove:
features_response = [item for item in features_response if item not in items_to_remove]
features_response
['LIMIT_BAL',
'EDUCATION',
'MARRIAGE',
'AGE',
'PAY_1',
'BILL_AMT1',
'BILL_AMT2',
'BILL_AMT3',
'BILL_AMT4',
'BILL_AMT5',
'BILL_AMT6',
'PAY_AMT1',
'PAY_AMT2',
'PAY_AMT3',
'PAY_AMT4',
'PAY_AMT5',
'PAY_AMT6',
'default payment next month']
The use of if
and not in
within the list comprehension is fairly self-explanatory. Easy readability in structures such as list comprehensions is one of the reasons for the popularity of Python.
> Note: The Python documentation defines list comprehension as the following:
>
> “A list comprehension consists of brackets containing an expression followed by a for clause, then zero or more for or if clauses.”
>
> Therefore, list comprehensions can enable you to do things with less code, in a way that is usually pretty readable and understandable.
Get hands-on with 1300+ tech skills courses.