Exploring Categorical Quantities
This lesson will focus on how to explore relationships between different categorical variables in the dataset with examples.
We'll cover the following...
Exploratory Data Analysis is all about exploring relationships in the dataset that might be hidden or might not be easy to spot just by looking at the dataset. We will try to explore these kinds of relationships in the Default of Credit Card Clients Dataset. We will use the cleaned version of the dataset from the lesson Inconsistent Data. The details of individual columns are mentioned below.
# Default of Credit Card Clients Dataset# There are 25 variables:# ID: ID of each client# LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit# GENDER: Gender (male,female)# EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others)# MARRIAGE: Marital status (married, single, others)# AGE: Age in years# PAY_1: Repayment status in September, 2005 (0=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)# PAY_2: Repayment status in August, 2005 (scale same as above)# PAY_3: Repayment status in July, 2005 (scale same as above)# PAY_4: Repayment status in June, 2005 (scale same as above)# PAY_5: Repayment status in May, 2005 (scale same as above)# PAY_6: Repayment status in April, 2005 (scale same as above)# BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)# BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)# BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)# BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)# BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)# BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)# PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)# PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)# PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)# PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)# PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)# PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)# default.payment.next.month: Default payment (yes,no)
More specifically, we are interested in finding out how the variable default.payment.next.month
is affected by other variables.
Grouping
As we saw in Chapter 3 of this course, grouping data can give us very useful insights. Let’s see how the categorical variables GENDER
, EDUCATION
, and MARRIAGE
are related to default.payment.next.month
.
GENDER
import pandas as pdimport matplotlib.pyplot as pltdf = pd.read_csv('credit_card_cleaned.csv')# Group datagrouped_df = df.groupby(['GENDER','default.payment.next.month']).size()grouped_df = grouped_df.unstack()print(grouped_df)# Plotgrouped_df.plot(kind='bar')# Calculate probabilitiesgrouped_df['prob_default'] = grouped_df['yes'] / (grouped_df['no'] + grouped_df['yes'])print('\n\n',grouped_df[['prob_default']])
We group the data by EDUCATION
and default.payment.next.month
on line 6 and use the function size
to retrieve the number of males and females. We then use the function unstack
in the next line. The function unstack
performs two steps here:
- It changes the table into a dataframe
- It names the columns
no
andyes
, the two categories of the variabledefault.payment.next.month
.