Data Scrubbing Operation: Drop Missing Values
We will cover ways of removing missing data values.
Quick overview: Another common but more complicated problem is deciding what to do with missing data. Missing data can be split into three categories:
- Missing completely at random (MCAR)
- Missing at random (MAR)
- Nonignorable.
In other words, the reason why the value is missing is linked to another variable in the dataset and not due directly to the value itself.
Lastly, nonignorable missing data constitutes the absence of data due directly to its own value or significance of the information. For example, tax-evading citizens or respondents with a criminal record may decline to supply information to certain questions due to feelings of sensitivity towards that question. The irony of these three categories is that it’s difficult to diagnose why the data is missing because the data is missing.
Problem-solving skills and awareness of these three categories can help diagnose and correct the root cause of missing values. This might include rewording surveys for second-language speakers to solve data missing at random or redesigning data collection methods, such as observing sensitive information rather than asking for this information directly from participants, to find nonignorable missing values.
A rough understanding of why certain data is missing can also help to influence how we manage and treat missing values. If male participants, for example, are more willing to supply information about their salary than female participants, this would eliminate using the mean (of mostly male respondents) from the existing data to populate the missing values (of mostly female respondents).
Managing MCAR is relatively straightforward as the data values collected can be considered a random sample and are more easily aggregated or estimated. We’ll discuss common methods for filling missing values in this chapter, but first let’s review the code in Python for inspecting missing values.
df.isnull().sum()
Get hands-on with 1400+ tech skills courses.