Data integrity verification using Boolean masks

In this exercise, with our knowledge of Boolean arrays, we will examine some of the duplicate IDs we discovered. In Exercise: Verifying Basic Data Integrity, we learned that no ID appears more than twice. We can use this learning to locate duplicate IDs and examine them. Then we take action to remove rows of dubious quality from the dataset. Perform the following steps to complete the exercise:

Continuing where we left off in previous exercise, we need to get the locations of the id_counts Series, where the count is 2, to locate the duplicates. First, we load the data and get the value counts of IDs to bring us to where we left off in the last exercise lesson, then we create a Boolean mask locating the duplicated IDs with a variable called dupe_mask and display the first five elements. Use the following commands:
```
import pandas as pd
df = pd.read_excel('default_of_credit_card_clients'\
'__courseware_version_1_21_19.xls')
 
id_counts = df['ID'].value_counts()
id_counts.head()
    
dupe_mask = id_counts == 2
dupe_mask[0:5]
```
You will obtain the following output (note the ordering of IDs may be different in your output, as value_counts sorts on frequency, not the index of IDs):
```
# 52bcd5ae-72d3    True
# f5e3478e-cf68    True
# 5deff6b6-62ff    True
# cb18af1f-3b53    True
# ac821a7b-b399    True
# Name: ID, dtype: bool
```
Note that in the preceding output, we are displaying only the first five entries using dupe_mask to illustrate the contents of this array. You can edit the integer indices in the square brackets ([]) to change the number of entries displayed in the output.

Our next step is to use this logical mask to select the IDs that are duplicated. The IDs themselves are contained as the index of the id_count ...

Introduction

Data Exploration and Cleaning

(Challenge) Exploring Remaining Financial Features in Dataset

Introduction to scikit-learn and Model Evaluation

Fake News Detection Using Scikit-learn

(Challenge) Logistic Regression and Precision-Recall Curve

Details of Logistic Regression and Feature Extraction

(Challenge) Logistic Regression Model and Coefficients

The Bias-Variance Trade-Off

(Challenge) Cross-Validation and Feature Engineering

Decision Trees and Random Forests

(Challenge) Cross-Validation Grid Search with Random Forest

Gradient Boosting, XGBoost, and SHAP Values

(Challenge) XGBoost and SHAP Explanation for Case Study Data

Predict Frog Toxicity with Python and XGBoost

Test Set Analysis, Financial Insights, and Delivery to the Client

(Challenge) Deriving Financial Insights

Appendix

Exercise: Continuing Verification of Data Integrity

Data integrity verification using Boolean masks