Handling Missing Data

Missing data is present in many real-world datasets and is often handled by removing these data points or imputing them. Imputing is defined as replacing the data with estimated values. In this lesson, we'll learn how data storytellers handle missing data.

Why analyze missing data?

In some cases, missing data can be helpful to understand potential trends/insights that are not part of our dataset. Missing data can be caused due to several different factors, such as:

  • Erroneous reporting. For example, consider a digital surveillance camera that is damaged due to weather conditions and is consistently producing blurry footage, or a damaged temperature sensor on a manufacturing floor that is reporting incorrect measurements.

  • Participants who don't wish to provide certain data for survey/

Depending on the programming framework and libraries we are using, examples of types of formats of missing data include Nan, N/A, NA, 0 values, and more.

There are also types of missing data including:

  • Structurally missing data: The missing data is data that does not exist in the first place.

  • Missing completely at random (MCAR): The data is missing at random values in the data and could be deleted.

  • Missing at random (MAR): There is a possibility the missing values could be predicted and could be deleted.

  • Missing not at random (MNAR): There might be a structure to the missing data, but we don't have the data to predict it. This type of data should not be deleted.

Survey data example

Consider the following hypothetical example of a survey data use case.

For this scenario, survey participants are asked to provide information on their expertise and experience with a tool so the developers can better understand the tool's user population. Let's take a look at the dataset:

Get hands-on with 1200+ tech skills courses.