Analyzing Individual Quantities
This lesson focuses on how to analyze different quantities to look for skewness and bias in the data.
We'll cover the following...
Analyzing individual variables is usually the way to start with EDA after figuring out data types. Summarizing a variable or looking at its distribution can be very helpful.
We will be using the Default of Credit Card Clients Dataset. This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. However, we will use the cleaned version of the dataset from the lesson Inconsistent Data. The details of individual columns are mentioned below.
# Default of Credit Card Clients Dataset# There are 25 variables:# ID: ID of each client# LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit# GENDER: Gender (male,female)# EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others)# MARRIAGE: Marital status (married, single, others)# AGE: Age in years# PAY_1: Repayment status in September, 2005 (0=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)# PAY_2: Repayment status in August, 2005 (scale same as above)# PAY_3: Repayment status in July, 2005 (scale same as above)# PAY_4: Repayment status in June, 2005 (scale same as above)# PAY_5: Repayment status in May, 2005 (scale same as above)# PAY_6: Repayment status in April, 2005 (scale same as above)# BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)# BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)# BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)# BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)# BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)# BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)# PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)# PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)# PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)# PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)# PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)# PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)# default.payment.next.month: Default payment (yes,no)
Summary stats
Summarizing a variable can give us useful information which can be used to draw conclusions or make decisions. Some common summarizing statistics are:
- mean
- median
- quartiles
We can use the describe
function on our dataframe which summarizes individual columns for us, or we can select the columns that we want and use functions like mean
, std
, and max
on them.
import pandas as pddf = pd.read_csv('credit_card_cleaned.csv')# Get summary statsprint(df[['EDUCATION','AGE']].describe())
We have selected two variables and then called the function describe
on them in line 4. The output of line 4 gives us the count, mean, standard deviation, quartiles, minimum, and maximum.
By looking at the output, we find out that
- The average age is