Managing Missing Data

Learn the common techniques for managing missing data effectively in pandas.

Introduction

Managing missing data effectively is a fundamental aspect of data preprocessing in data science. We learned earlier that NaN values tend to be propagated in pandas objects during calculations. While this propagation feature can be desirable, we often need to manipulate these NaN values to achieve accurate and meaningful analysis.

In this lesson, we’ll look at four techniques for managing and remedying missing data:

  • Filling

  • Replacing

  • Interpolating

  • Dropping

For this lesson, we’ll be using a mock transaction dataset of an e-commerce business, as shown below:

E-Commerce Transaction Dataset

transaction_id

customer_id

product

quantity

price

date

rating

1

13

Charger

2

23.81

2023-07-22

1.0

2

16

Mouse

<NA>

47.12

2023-07-23

5.0

3

1

Charger

2

38.18

NaT

2.0

4

4

Mouse

4

NaN

2023-07-25

3.0

5

4

Mouse

2

16.59

2023-07-26

NaN

Notice that our dummy dataset has a good mix of the different null value representations, i.e., <NA>, NaT, and NaN.

Filling

One of the initial approaches for managing missing data is using the fillna() method, which allows us to substitute NaN values with a specified replacement. The replacement can be a fixed number, a measure of central tendency such as mean or median, or a value inferred from other non-null observations (such as forward and backward filling).

The following code example demonstrates the various filling strategies that we can adopt:

Press + to interact
# Fill missing values with a fixed scalar value (1 in this case)
df['quantity_fill_scalar'] = df['quantity'].fillna(1)
# Fill missing values with a string value
df['quantity_fill_str'] = df['quantity'].fillna('missing')
# Fill missing values with mean of all prices
df['price_fill_mean'] = df['price'].fillna(df['price'].mean())
# Forward fill of ratings (propagates the last observed non-null value forward)
df['rating_ffill'] = df['rating'].fillna(method='ffill')
# Backward fill of transaction date (propagate the next observed non-null value backward)
df['date_ffill'] = df['date'].fillna(method='bfill')
# View output
print(df)

The output shows that the missing values have been filled according to the specifications we provided for each column.

Note: We can also use the ffill() method (equivalent to fillna(method='ffill')) or the ...