Managing Missing Data
Learn the common techniques for managing missing data effectively in pandas.
We'll cover the following...
Introduction
Managing missing data effectively is a fundamental aspect of data preprocessing in data science. We learned earlier that NaN
values tend to be propagated in pandas
objects during calculations. While this propagation feature can be desirable, we often need to manipulate these NaN
values to achieve accurate and meaningful analysis.
In this lesson, we’ll look at four techniques for managing and remedying missing data:
Filling
Replacing
Interpolating
Dropping
For this lesson, we’ll be using a mock transaction dataset of an e-commerce business, as shown below:
E-Commerce Transaction Dataset
transaction_id | customer_id | product | quantity | price | date | rating |
1 | 13 | Charger | 2 | 23.81 | 2023-07-22 | 1.0 |
2 | 16 | Mouse | <NA> | 47.12 | 2023-07-23 | 5.0 |
3 | 1 | Charger | 2 | 38.18 | NaT | 2.0 |
4 | 4 | Mouse | 4 | NaN | 2023-07-25 | 3.0 |
5 | 4 | Mouse | 2 | 16.59 | 2023-07-26 | NaN |
Notice that our dummy dataset has a good mix of the different null value representations, i.e., <NA>
, NaT
, and NaN
.
Filling
One of the initial approaches for managing missing data is using the fillna()
method, which allows us to substitute NaN
values with a specified replacement. The replacement can be a fixed number, a measure of central tendency such as mean or median, or a value inferred from other non-null observations (such as forward and backward filling).
The following code example demonstrates the various filling strategies that we can adopt:
# Fill missing values with a fixed scalar value (1 in this case)df['quantity_fill_scalar'] = df['quantity'].fillna(1)# Fill missing values with a string valuedf['quantity_fill_str'] = df['quantity'].fillna('missing')# Fill missing values with mean of all pricesdf['price_fill_mean'] = df['price'].fillna(df['price'].mean())# Forward fill of ratings (propagates the last observed non-null value forward)df['rating_ffill'] = df['rating'].fillna(method='ffill')# Backward fill of transaction date (propagate the next observed non-null value backward)df['date_ffill'] = df['date'].fillna(method='bfill')# View outputprint(df)
The output shows that the missing values have been filled according to the specifications we provided for each column.
Note: We can also use the
ffill()
method (equivalent tofillna(method='ffill')
) or the ...