Exercise: Continuing Verification of Data Integrity
Learn to verify the integrity of the data using boolean masks.
We'll cover the following...
Data integrity verification using Boolean masks
In this exercise, with our knowledge of Boolean arrays, we will examine some of the duplicate IDs we discovered. In Exercise: Verifying Basic Data Integrity, we learned that no ID appears more than twice. We can use this learning to locate duplicate IDs and examine them. Then we take action to remove rows of dubious quality from the dataset. Perform the following steps to complete the exercise:
-
Continuing where we left off in previous exercise, we need to get the locations of the
id_counts
Series, where the count is 2, to locate the duplicates. First, we load the data and get the value counts of IDs to bring us to where we left off in the last exercise lesson, then we create a Boolean mask locating the duplicated IDs with a variable calleddupe_mask
and display the first five elements. Use the following commands:import pandas as pd df = pd.read_excel('default_of_credit_card_clients'\ '__courseware_version_1_21_19.xls') id_counts = df['ID'].value_counts() id_counts.head() dupe_mask = id_counts == 2 dupe_mask[0:5]
You will obtain the following output (note the ordering of IDs may be different in your output, as
value_counts
sorts on frequency, not the index of IDs):