How to analyze text using Amazon Comprehend

Key takeaways:

  • AWS Comprehend is used for text analysis.

  • AWS Comprehend is used for sentiment analysis, entity recognition, key phrase detection, language detection, and topic modeling.

  • Comprehend uses prebuilt machine learning algorithms to extract insights from data; however, we can also use custom models tailored to our specific use case.

  • AWS Comprehend can identify and remove Personally Identifiable Information (PII) from datasets.

Amazon Comprehend is a text analysis service provided by AWS that uses multiple ML models to extract insights from text. In this Answer, we’ll examine how to use a few of the built-in models.

How does AWS comprehend work?

AWS Comprehend is trained using vast data to analyze text, so we do not have to train the model before using the service. However, we can also use our custom AWS Comprehend models to perform data analysis tailored to our requirements.

What are the uses of AWS Comprehend?

AWS Comprehend is extensively used for multiple text-related tasks. Let’s explore different tasks and learn which command we can use to analyze text using comprehend:

1. Language detection

Comprehend uses language identifiers from RFS 5646 and can detect multiple languages. Additionally, it can break down the sentence to identify different parts of speech.

The command given below is used to detect the language in the sentence:

aws comprehend detect-dominant-language \
--text "Hello, how are you?"

2. Sentiment analysis

We can use Amazon Comprehend to find out the sentiment of the text. It groups sentiments into the following categories:

  • Positive

  • Negative

  • Mixed

  • Neutral

Let’s execute the command below in the terminal at the bottom to use this feature. This command uses Amazon Comprehend to determine the sentiment of the following text:

aws comprehend detect-sentiment \
--text "The weather is lovely today." \
--language-code "en"

3. Entity detection

We can use Amazon Comprehend to get information about entities detected in the provided text, such as people, organizations, locations, and dates. Amazon Comprehend can detect the following entities:

  • COMMERCIAL_ITEM

  • DATE

  • EVENT

  • LOCATION

  • ORGANIZATION

  • PERSON

  • TITLE

The code below uses this feature to detect entities in the text, as shown below:

aws comprehend detect-entities \
--text "Mexico is located to the South of the US." \
--language-code "en"

When we run the command, we get two important parameters in response; Score and Type. The type defines the type of the detected entity. For example, Mexico will be an entity of type LOCATION in the example given above. Similarly, the score associated with each detected entity represents the degree of assurance that Amazon Comprehend has over the accuracy of the entity type detection. With this score, we can prevent inaccurate detections.

We can leverage Amazon Comprehend’s entity recognition capabilities to identify PII entities within the text and then implement a process to redact or remove those entities from the file.

4. Topic modeling

Topic modeling can categorize multiple documents based on their topic. For example, we can use it to categorize news articles into nature, politics, medicine, etc. AWS Comprehend analyzes each word in a document. The set of words frequently corresponding to a particular context makes up the topic.

On AWS Comprehend, topic modeling is an asynchronous process. We provide the list of documents stored in an S3 bucket to the StartTopicsDetectionJob operation, which returns results to an output S3 bucket.

Topic modelling in AWS Comprehend
Topic modelling in AWS Comprehend

Let’s see how we can create a Comprehend topic detection job using AWS CLI. First, set up input and output S3 buckets and add documents to the input S3 bucket. Also, create an IAM role that gives our topic detection job permission to read and write to the S3 buckets.

aws comprehend start-topics-detection-job \
--number-of-topics topics to return \
--job-name "job name" \
--region region \
--cli-input-json file://path to JSON input file

Here the cli-input-json file will contain the JSON configurations of the input and output S3 buckets as well as the role with permissions to access the buckets:

{
"InputDataConfig": {
"S3Uri": "s3://input bucket/input path",
"InputFormat": "ONE_DOC_PER_FILE"
},
"OutputDataConfig": {
"S3Uri": "s3://output bucket/output path"
},
"DataAccessRoleArn": "arn:aws:iam::account ID:role/data access role"
}

We’ll pass the ARN of the IAM role with read-and-write access as the DataAccessRoleArn in this configuration file.

Hands-on exercise

Enter your AWS access_key_id and secret_access_key in the widget below before running any commands. If you don’t have these keys, follow the steps in this documentation to generate them.

Note: The IAM user whose credentials are used must have permission to perform all the required actions.

After the successful configuration of AWS, try out the above “language detection,” “sentiment analysis,” and “entity detection” commands in the terminal below:

Terminal 1
Terminal
Loading...

Frequently asked questions

Haven’t found what you were looking for? Contact Us


What is the difference between AWS Comprehend and AWS Textract?

Amazon Textract is used to extract text from scanned images. On the other hand, Comprehend is used to get insights from data.


Which algorithm does AWS Comprehend use?

AWS Comprehend uses a machine learning model based on Latent Dirichlet allocation.


Can AWS Comprehend process text in real-time?

Yes, AWS Comprehend offers batch and real-time processing and analysis of data.


Free Resources

Copyright ©2024 Educative, Inc. All rights reserved