Data Preprocessing: Identifiers

Learn how we can remove identifying data.

Identifiers

The goal of machine learning is to create an algorithm that can predict data. Or, as we said before: to put a label on a thing. While we use already labeled data when building the algorithm, the goal is to predict labels we don’t know yet.

We don’t tell the algorithm how it can decide which label to select. Instead, we give the algorithm the data and let it figure it out. That being said, an intelligent algorithm might be able to memorize all the data we provide it with. This is referred to as overfitting. The result is an algorithm performing well on known data but poorly on unknown data.

If our goal was only to predict labels we already know, the best thing we could do is memorize all passengers and whether they survived or not. But if we want to create an algorithm that performs well even on unknown data, we need to prevent memorization.

We haven’t even started building your algorithm, yet the features we use in our algorithm affect whether the algorithm can memorize data, because we have potential identifiers in our data.

  • When looking at the first five entries of the dataset, three columns appear suspicious: the “PassengerId,” the “Name,” and the “Ticket”.

  • The “PassengerId” is a consecutive number. Therefore, there should be no connection between how big the number is and whether a passenger survived.

  • A passenger’s name or the number on a ticket shouldn’t be a decisive factor for survival. Instead, these are data identifying single passengers. Let’s validate this assumption. Let’s take a look at how many unique values are in these columns.

Get hands-on with 1400+ tech skills courses.