To better illustrate how dangerous reidentification is, we examine a relevant example in the financial context. We’ll take the recent Experian data breaches as inspiration.
Setup
Imagine we have three datasets—the Netflix ratings dataset (made public for research/competitions), the IMDb ratings dataset (always public), and credit data from Experian (obtained and released through a major data breach or leak).
Summary
Here’s a quick summary of the steps we’ll use to illustrate reidentification:
Build the Netflix dataset (an individual
user
gives amovie
arating
), the IMDb dataset (an individualuser
with anemail
gives amovie
arating
), and the Experian dataset (an individualuser
with anemail
and aname
has acredit_score
and anannual_income
).Match the Netflix table with the IMDb dataset to see if we can find users with near-identical ratings across the two different sites.
Match similar users’ emails with those in the Experian dataset to retrieve their credit scores and incomes with high fidelity.
This information can then be used for predatory advertising or simply sold on the dark web.
Get hands-on with 1400+ tech skills courses.