...

/

Exploring the Financial History Features in the Dataset

Exploring the Financial History Features in the Dataset

Learn to explore the financial history features of the dataset.

We are ready to explore the rest of the features in the case study dataset. First set up the environment and load data from the previous exercise. This can be done using the following snippet:

import pandas as pd
import matplotlib.pyplot as plt #import plotting package
#render plotting automatically
%matplotlib inline
import matplotlib as mpl #additional plotting functionality
mpl.rcParams['figure.dpi'] = 400 #high resolution figures
import numpy as np
df = pd.read_csv('Chapter_1_cleaned_data.csv')

Investigating the financial history features of the dataset

The remaining features to be examined are the financial history features. They fall naturally into three groups: the status of the monthly payments for the last 6 months, and the billed and paid amounts for the same period. First, let’s look at the payment statuses. It is convenient to break these out as a list so we can study them together. You can do this using the following code:

pay_feats = ['PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']

We can use the describe method on these six Series to examine summary statistics:

df[pay_feats].describe()

This should produce the following output:

Press + to interact
Summary statistics of payment status features
Summary statistics of payment status features

Here, we observe that the range of values is the same for all of these features: -2, -1, 0, … 8. It appears that the value of 9, described in the data dictionary as payment delay for nine months and above, is never observed.

We have already clarified the meaning of all of these levels, some of which were not in the original data dictionary. Now let’s look again at the value_counts() of PAY_1, now sorted by the values we are counting, which are the index of this Series:

df[pay_feats[0]].value_counts().sort_index()

This should produce the following output:

# -2     2476
# -1     5047
# 0    13087
# 1     3261
# 2     2378
# 3      292
# 4       63
# 5       23
# 6       11
# 7        9
# 8       17
# Name: PAY_1, dtype: int64

Compared to the positive integer values, most of the values are either -2, -1, or 0, which correspond to an account that was in good standing last month: not used, paid in full, or made at least the minimum payment.

Notice that, because of the definition of the other values of this variable (1 = payment delay for 1 month; 2 = payment delay for 2 months, and so on), this feature is sort of a hybrid of categorical and numerical features. Why should no credit usage correspond to a value of -2, while a value of 2 means a 2-month late payment, and so on? We should acknowledge that the numerical coding of payment ...

Access this course and 1400+ top-rated courses and projects.