Exercise: Visualizing the Feature and Response Variable Relationship
Learn how to visualize the relationship between the features and response variable.
We'll cover the following
Relationship between features and response variable
In this exercise, you will further your knowledge of plotting functions from Matplotlib that you used earlier in this course. You’ll learn how to customize graphics to better answer specific questions with the data. As you pursue these analyses, you will create insightful visualizations of how the PAY_1
and LIMIT_BAL
features relate to the response variable, which may possibly provide support for the hypotheses you formed about these features. This will be done by becoming more familiar with the Matplotlib Application Programming Interface (API), in other words, the syntax you use to interact with Matplotlib. Perform the following steps to complete the exercise:
-
Calculate a baseline for the response variable of the default rate across the whole dataset using pandas’
mean()
:overall_default_rate = df['default payment next month'].mean() overall_default_rate
The output of this should be the following:
# 0.2217971797179718
What would be a good way to visualize default rates for different values of the
PAY_1
feature?Recall our observation that this feature is sort of like a hybrid categorical and numerical feature. We’ll choose to plot it in a way that is typical for categorical features, due to the relatively small number of unique values. In the chapter “Data Exploration and Cleaning,” we did
value_counts
of this feature as part of data exploration, then later we learned aboutgroupby
/mean
when looking at theEDUCATION
feature.groupby
/mean
would be a good way to visualize the default rate again here, for different payment statuses. -
Use this code to create a
groupby
/mean
aggregation:group_by_pay_mean_y = df.groupby('PAY_1').agg( {'default payment next month':np.mean}) group_by_pay_mean_y
The output should look as follows:
Get hands-on with 1300+ tech skills courses.