Data Cleaning Using Apache Spark: Duplicate Data

Learn several methods for collecting and removing duplicate data.

Another aspect of data cleaning is removing duplicate data. The goal is to ensure that each record in the data is unique and that no duplicate records exist. It is important because duplicates can lead to inaccurate results, cause errors when loading the duplicate data into an existing schema, and negatively impact data analysis and reporting.

Duplicates can happen for many reasons: human error, bugs, and errors when joining/merging data from different sources.

Handling duplicate data

Similar to handling missing data, we should try to understand whether a duplicate value needs to be removed. It requires some business context, and our actions will depend on the desired outcome.

When we find duplicate records of data, we can remove them, aggregate them into single records, we can mark them with an additional column, or just keep them as they are. Generally, there are two main types of duplicate data:

  1. Identical duplicates

  2. Near-identical duplicates

Let’s see an example of handling identical and near-identical duplicate records. For this demonstration, we’ll use a CSV file containing customer data. This table is stored in an OLTP relational database. Therefore, all rows must be unique and identified by the primary keys.

The primary keys in this case are:

  1. customer_id

  2. first_name, last_name, and address

A sample of the data looks like this:

Get hands-on with 1400+ tech skills courses.