How to analyze text using Amazon Comprehend

Key takeaways:

AWS Comprehend is used for text analysis.
AWS Comprehend is used for sentiment analysis, entity recognition, key phrase detection, language detection, and topic modeling.
Comprehend uses prebuilt machine learning algorithms to extract insights from data; however, we can also use custom models tailored to our specific use case.
AWS Comprehend can identify and remove Personally Identifiable Information (PII) from datasets.

Amazon Comprehend is a text analysis service provided by AWS that uses multiple ML models to extract insights from text. In this Answer, we’ll examine how to use a few of the built-in models.

How does AWS comprehend work?

AWS Comprehend is trained using vast data to analyze text, so we do not have to train the model before using the service. However, we can also use our custom AWS Comprehend models to perform data analysis tailored to our requirements.

What are the uses of AWS Comprehend?

AWS Comprehend is extensively used for multiple text-related tasks. Let’s explore different tasks and learn which command we can use to analyze text using comprehend:

1. Language detection

Comprehend uses language identifiers from RFS 5646 and can detect multiple languages. Additionally, it can break down the sentence to identify different parts of speech.

The command given below is used to detect the language in the sentence:

When we run the command, we get two important parameters in response; Score and Type. The type defines the type of the detected entity. For example, Mexico will be an entity of type LOCATION in the example given above. Similarly, the score associated with each detected entity represents the degree of assurance that Amazon Comprehend has over the accuracy of the entity type detection. With this score, we can prevent inaccurate detections.

We can leverage Amazon Comprehend’s entity recognition capabilities to identify PII entities within the text and then implement a process to redact or remove those entities from the file.

4. Topic modeling

Topic modeling can categorize multiple documents based on their topic. For example, we can use it to categorize news articles into nature, politics, medicine, etc. AWS Comprehend analyzes each word in a document. The set of words frequently corresponding to a particular context makes up the topic.

On AWS Comprehend, topic modeling is an asynchronous process. We provide the list of documents stored in an S3 bucket to the StartTopicsDetectionJob operation, which returns results to an output S3 bucket.

We’ll pass the ARN of the IAM role with read-and-write access as the DataAccessRoleArn in this configuration file.

Hands-on exercise

Enter your AWS access_key_id and secret_access_key in the widget below before running any commands. If you don’t have these keys, follow the steps in this documentation to generate them.

Note: The IAM user whose credentials are used must have permission to perform all the required actions.

After the successful configuration of AWS, try out the above “language detection,” “sentiment analysis,” and “entity detection” commands in the terminal below: