Exercise: Exploring the Credit Limit and Demographic Feature
Learn to identify and correct data quality issues and visualize continuous data using histograms.
We'll cover the following
Data quality assurance and exploration
So far, we remedied two data quality issues just by asking basic questions or by looking at the info()
summary. Let’s now take a look at the first few columns of data. Before we get to the historical bill payments, we have the credit limits of the LIMIT_BAL
accounts, and the SEX
, EDUCATION
, MARRIAGE
, and AGE
demographic features. Our business partner has reached out to us, to let us know that gender should not be used to predict credit-worthiness, as this is unethical by their standards. So we keep this in mind for future reference. Now we’ll explore the rest of these columns, making any corrections that are necessary.
In order to further explore the data, we will use histograms. Histograms are a good way to visualize data that is on a continuous scale, such as currency amounts and ages. A histogram groups similar values into bins and shows the number of data points in these bins as a bar graph.
To plot histograms, we will start to get familiar with the graphical capabilities of pandas. pandas relies on another library called Matplotlib to create graphics, so we’ll also set some options using matplotlib
. Using these tools, we’ll also learn how to get quick statistical summaries of data in pandas.
Visualizing the features using histograms
In this exercise, we’ll start our exploration of data with the credit limit and age features. We will visualize them and get summary statistics to check that the data contained in these features is sensible. Then we will look at the education and marriage categorical features to see if the values there make sense, correcting them as necessary. LIMIT_BAL
and AGE
are numerical features, meaning they are measured on a continuous scale. Consequently, we’ll use histograms to visualize them. Perform the following steps in the Jupyter notebook at the end of the lesson to complete the exercise:
-
In addition to
pandas
, importmatplotlib
and set up some plotting options with this code snippet. Note the use of comments in Python with#
. Anything appearing after a#
on a line will be ignored by the Python interpreter:import pandas as pd import matplotlib.pyplot as plt #import plotting package #render plotting automatically %matplotlib inline import matplotlib as mpl #additional plotting functionality mpl.rcParams['figure.dpi'] = 400 #high resolution figures
This imports
matplotlib
and usesrcParams
to set the resolution (dpi
= dots per inch) for a nice crisp image; you may not want to worry about this last part unless you are preparing things for presentation, as it could make the images quite large in the notebook. -
Load our progress from the previous exercise using the following code:
df_clean_2 = pd.read_csv('df_clean_2.csv')
-
Run
df_clean_2[['LIMIT_BAL', 'AGE']].hist()
and you should see the following histograms:
Get hands-on with 1400+ tech skills courses.