Effective Data Manipulation with pandas/

...

Dealing with Missing and Duplicated Data

Learn how to find missing and duplicated data in a DataFrame.

We'll cover the following...

Missing data
Duplicates
Summary

If we need to do an analysis or create machine learning models on our data, we must make sure that our data is complete before we start to report on it. Also, many machine learning models will fail to train if we try to train them on DataFrames with missing values.

We’re going to jump back to the presidential data again.

Missing data

Determining where data is missing involves the same methods as we saw in a Series. We just need to remember that a DataFrame has an extra dimension. The DataFrame has an .isna method that returns a DataFrame with true and false values indicating whether values are missing:

Press + to interact

Python 3.8

def tweak_siena_pres(df):
    def int64_to_uint8(df_):
        cols = df_.select_dtypes('int64')
        return (df_
                .astype({col:'uint8' for col in cols}))
    return (df
     .rename(columns={'Seq.':'Seq'})    # 1
     .rename(columns={k:v.replace(' ', '_') for k,v in
        {'Bg': 'Background',
         'PL': 'Party leadership', 'CAb': 'Communication ability',
         'RC': 'Relations with Congress', 'CAp': 'Court appointments',
         'HE': 'Handling of economy', 'L': 'Luck',
         'AC': 'Ability to compromise', 'WR': 'Willing to take risks',
         'EAp': 'Executive appointments', 'OA': 'Overall ability',
         'Im': 'Imagination', 'DA': 'Domestic accomplishments',
         'Int': 'Integrity', 'EAb': 'Executive ability',
         'FPA': 'Foreign policy accomplishments',
         'LA': 'Leadership ability',
         'IQ': 'Intelligence', 'AM': 'Avoid crucial mistakes',
         'EV': "Experts' view", 'O': 'Overall'}.items()})
     .astype({'Party':'category'})  # 2
     .pipe(int64_to_uint8)  # 3
     .assign(Average_rank=lambda df_:(df_.select_dtypes('uint8') # 4
                 .sum(axis=1).rank(method='dense').astype('uint8')),
             Quartile=lambda df_:pd.qcut(df_.Average_rank, 4,
                 labels='1st 2nd 3rd 4th'.split())
            )
    )
import pandas as pd
url = 'https://github.com/mattharrison/datasets/raw/master/data/siena2018-pres.csv'
df = pd.read_csv(url, index_col=0)
pres = tweak_siena_pres(df)
print(pres.isna())

Introduction

Series Deep Dive

DataFrames

Manipulating Data

Wrapping Up

Appendix

Dealing with Missing and Duplicated Data

Missing data