Anonymizing and Encrypting Using Python

Learn about anonymizing and encrypting sensitive data as a part of the transform stage in an ETL pipeline.

When dealing with sensitive data such as passwords, financial data, medical records, or confidential business information, we often need to protect it somehow. During the transform stage of the ETL pipeline, we might need to employ data anonymization or data encryption methods.

Data anonymization

During data anonymization, we remove or obscure Personally Identifiable Information(PII) from a dataset to keep the privacy of users and clients.

There are several methods of anonymizing data, including:

  • Masking: Replacing sensitive information with characters such as asterisks.

  • Perturbation: Adding random noise or error to the data to obscure specific values. For example, a dataset of GPS locations of users used for a statistical analysis might be perturbed by adding some random, normally distributed noise to keep the exact coordinates hidden while still allowing the analysts to perform statistical analysis on the overall distribution of the dataset.

  • Hashing: Using a hash function to transform the original data into a fixed-size and unique string of characters. For example, when storing password data, we hash the passwords and compare the hash to the stored hash value during authentication.

Data encryption

Data encryption is the process of converting sensitive data into an encrypted form to protect it from unauthorized access. Only the users who encrypted the data in the first place can decipher and access it.

There are two types of encryption methods:

  1. Symmetric encryption: In symmetric encryption, we use the same key or encryption function to both encrypt the data and decipher it. This means that symmetric encryption is only useful when the key is kept confidential and protected from unauthorized access.

Get hands-on with 1400+ tech skills courses.