Data Cleaning Using Apache Spark: Missing Data
Learn how to identify the different types of missing values and how to handle them.
When combining data from multiple sources, we often find errors, inconsistencies, and inaccuracies. During the data cleaning step, we focus on finding and removing duplicate data and handling missing values.
Both of these operations might sound easy in theory, but in practice, they require a great deal of business context and understanding of the data and business before we can take any action.
Handling missing values
When we encounter missing values, the first thing we should do is ask ourselves why the data is missing. How much is missing? And can we fill it?
There could be countless reasons why data is missing:
Human error (accidental deletion)
Technical errors (blackouts in the middle of updating a DB)
Even a thunderstorm near a sensory device
The easiest thing to do when a value is missing is to remove the entire record or column. However, the data might be important, and we should keep as much of it as possible. If we could figure out why the data went missing, we might be able to salvage it and fill it with another value.
Types of missing values
Generally, there are three types/reasons why a data point could be missing:
Missing completely at random (MCAR)
This refers to a situation where the probability that a data point is missing is independent of the values of the missing data point and any other variables in the dataset. The missing data is not related to any pattern or structure in the data. If a data point is truly missing completely at random, we should either drop it or replace it with the mean, median, or mode values.
For example, a speed radar produced a missing value because of an earthquake. It had nothing to do with the car (other variables) or its speed (missing value). It was completely random.
Sample data collected by the speed radar:
Get hands-on with 1400+ tech skills courses.