How to remove PII using Amazon Comprehend

What are PII entities?

Personally identifiable information (PII) entities refer to the specific information that can be used to identify an individual. These are sensitive data elements that, when linked together, can potentially reveal a person’s identity. These entities include full name, social security number, date of birth, address, and phone number.

Protecting PII is crucial for privacy and security reasons. Organizations that handle PII must implement strict security measures and comply with data protection regulations to prevent unauthorized access, use, or disclosure of this sensitive information.

Amazon Comprehend

Amazon Comprehend is a natural language processing service by Amazon Web Services (AWS) that uses machine learning to analyze and extract insights from text. Amazon Comprehend doesn’t provide a direct feature for automatically removing PII entities from a file. However, we can leverage the entity recognition capabilities of Amazon Comprehend to identify PII entities within the text and then implement a process to redact or remove those entities from the file.

Next, we will look at some code to see how to use Amazon Comprehend to remove PII entities from our text files.

Code

Below is an example of how we can remove PII entities from our text using Amazon Comprehend:

main.py
personal-info.txt
import boto3
import os
aws_access_key = "<ENTER YOUR AWS ACCESS KEY HERE>"
aws_secret_key = "<ENTER YOUR AWS SECRET KEY HERE>"
aws_region = "<ENTER YOUR REGION HERE>"
comprehend = boto3.client('comprehend', aws_access_key_id=aws_access_key, aws_secret_access_key=aws_secret_key, region_name=aws_region)
def remove_pii(text):
response = comprehend.detect_pii_entities(
Text=text,
LanguageCode='en'
)
pii_entities = response['Entities']
redacted_text = text
for entity in pii_entities:
start_offset = entity['BeginOffset']
end_offset = entity['EndOffset']
text_to_remove = redacted_text[start_offset:end_offset]
redacted_text = redacted_text.replace(text_to_remove, "*"*(end_offset-start_offset))
return redacted_text
file = open("./personal-info.txt", "r")
document_text = file.read()
file.close()
print("Original text: ", document_text)
filtered_text = remove_pii(document_text)
print("New Text: ", filtered_text)

Note: Replace the aws_access_key, aws_secret_key, and aws_region with your credentials for the above widget to work properly.

Explanation

  • Lines 4–8: We are setting up the AWS credentials and then we use the provided credentials to create a client for the AWS Comprehend service to access all of its functionalities.

  • Lines 11–21: Inside our remove_pii() function, we call the comprehend.detect_pii_entities() function and pass our extracted text to it for which it returns a dictionary containing information about the detected PII entites. We use the starting and ending offset of the detected pii entities to replace them with * instead.

  • Lines 25–30: We are reading our provided text file to extract the text and then we use it to call our function remove_pii() to remove the PII entities.

Conclusion

To conclude, we saw how easy it is to utilize the AWS comprehend service to extract information regarding PII entities in any given text and how we can remove said PII entities from our text using the information provided to us by AWS comprehend.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved