...

/

Inconsistent Data

Inconsistent Data

This lesson will focus on some of the common inconsistencies present in datasets and how to deal with them using pandas.

Inconsistency in data arises due to errors in collecting data. For instance, if the data was collected from multiple sources, or if the data was collected by multiple people who did not follow the same format of collecting data, then there is a high chance of inconsistencies in the data.

In this lesson, we will be cleaning the Credit Cards Default Dataset. This dataset is a very good example of the kind of inconsistencies that are present in most datasets.

Credit cards default dataset

The documented details of individual columns are mentioned below. But we will see that our dataset will not be consistent with this format.

Press + to interact
# There are 25 variables:
# ID: ID of each client
# LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
# GENDER: Gender (1=male, 2=female)
# EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
# MARRIAGE: Marital status (1=married, 2=single, 3=others)
# AGE: Age in years
# PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)
# PAY_2: Repayment status in August, 2005 (scale same as above)
# PAY_3: Repayment status in July, 2005 (scale same as above)
# PAY_4: Repayment status in June, 2005 (scale same as above)
# PAY_5: Repayment status in May, 2005 (scale same as above)
# PAY_6: Repayment status in April, 2005 (scale same as above)
# BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
# BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
# BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
# BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
# BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
# BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
# PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
# PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
# PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
# PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
# PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
# PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
# default.payment.next.month: Default payment (1=yes, 0=no)

Let’s load the dataset.

Press + to interact
import pandas as pd
# Read Data
df = pd.read_csv('credit_card.csv')
# Print head and column names
print(df.head())
print(df.columns)

Just by looking at the output, we can see that pandas keeps serial numbers for us automatically, and since we have IDs in the ID column, we do not need the first column, we can remove it. We can use the drop function to drop columns by ...

Access this course and 1400+ top-rated courses and projects.