Data Cleaning Using Apache Spark: Missing Data
Learn how to identify the different types of missing values and how to handle them.
When combining data from multiple sources, we often find errors, inconsistencies, and inaccuracies. During the data cleaning step, we focus on finding and removing duplicate data and handling missing values.
Both of these operations might sound easy in theory, but in practice, they require a great deal of business context and understanding of the data and business before we can take any action.
Handling missing values
When we encounter missing values, the first thing we should do is ask ourselves why the data is missing. How much is missing? And can we fill it?
There could be countless reasons why data is missing:
Human error (accidental deletion)
Technical errors (blackouts in the middle of updating a DB)
Even a thunderstorm near a sensory device
The easiest thing to do when a value is missing is to remove the entire record or column. However, the data might be important, and we should keep as much of it as possible. If we could figure out why the data went missing, we might be able to salvage it and fill it with another value.
Types of missing values
Generally, there are three types/reasons why a data point could be missing: ...