Ensuring Data Privacy in Practice
Learn about industry solutions to ensure data privacy, and how synthetic data and federated learning can be used to handle data privacy.
Theoretical approaches carry value, but this lesson will cover some of the more common techniques and tools used in the real world to ensure data privacy and minimize reidentification and leakage risks.
Synthetic twins
Synthetic data can create high-fidelity, fake “copies” of a dataset that doesn’t contain any of the PII (the protected classes) of the original set. Recall that in earlier lessons, we’ve discussed sourcing data synthetically. Here, we generate new synthetic sources from an existing dataset that retains all of the original properties but removes all of the PII.
There are a ton of solutions in the healthcare industry that attempt to remove HIPAA (a healthcare data compliance law) concerns by creating synthetic twin datasets. Synthetic data is usually generated via an adversarial algorithm. Recall that we spoke about this approach when we considered data bias. Essentially, one algorithm tries to identify if there’s a major difference between the two datasets while the other continues to iteratively create new versions of synthetic data to try and beat the identifier. Some companies that offer this functionality are MDClone and Octopize. Of course, in ...