Missing Data

Learn about the source of, and potential remedies for, missing data.

Causes of missing data

Missing data is a common occurrence when applying machine learning to business data. While there are many reasons for missing data, the following are the most common:

  • The data is collected via a manual process and is prone to errors (e.g., data being tracked in a spreadsheet).

  • Multiple datasets are joined together (e.g., joining database tables can produce missing values).

  • A particular feature is considered optional in the data source (e.g., an IT system).

  • Datasets are acquired from external sources (e.g., datasets acquired from governments often have missing values).

Missing data is so common, and strategies for dealing with missing data are critical for crafting the most valuable machine learning models.

Dealing with missing data

When dealing with missing data, there are six basic strategies:

  • Fix the data in the source system. This is often not possible.

  • Use an algorithm that can handle missing values automatically (e.g., CART).

  • If a small percentage of observations have missing data, remove those observations.

  • Remove the feature with missing data.

  • Find a proxy feature that can be used instead of the feature with missing data.

  • Fill in the missing data.

In practice, the last three strategies are the most commonly used. However, it’s worth noting that when a small percentage of observations have missing data, removing them is a simple and viable approach.

Proxy features

A proxy feature is highly correlated with another feature with missing data values. Take, for example, the following sample of data from the Titanic training dataset.

Get hands-on with 1200+ tech skills courses.