Amazon Textract is used to extract text from scanned images. On the other hand, Comprehend is used to get insights from data.
Key takeaways:
AWS Comprehend is used for text analysis.
AWS Comprehend is used for sentiment analysis, entity recognition, key phrase detection, language detection, and topic modeling.
Comprehend uses prebuilt machine learning algorithms to extract insights from data; however, we can also use custom models tailored to our specific use case.
AWS Comprehend can identify and remove Personally Identifiable Information (PII) from datasets.
Amazon Comprehend is a text analysis service provided by AWS that uses multiple ML models to extract insights from text. In this Answer, we’ll examine how to use a few of the built-in models.
AWS Comprehend is trained using vast data to analyze text, so we do not have to train the model before using the service. However, we can also use our custom AWS Comprehend models to perform data analysis tailored to our requirements.
AWS Comprehend is extensively used for multiple text-related tasks. Let’s explore different tasks and learn which command we can use to analyze text using comprehend:
Comprehend uses language identifiers from RFS 5646 and can detect multiple languages. Additionally, it can break down the sentence to identify different parts of speech.
The command given below is used to detect the language in the sentence:
aws comprehend detect-dominant-language \--text "Hello, how are you?"
We can use Amazon Comprehend to find out the sentiment of the text. It groups sentiments into the following categories:
Positive
Negative
Mixed
Neutral
Let’s execute the command below in the terminal at the bottom to use this feature. This command uses Amazon Comprehend to determine the sentiment of the following text:
aws comprehend detect-sentiment \--text "The weather is lovely today." \--language-code "en"
We can use Amazon Comprehend to get information about entities detected in the provided text, such as people, organizations, locations, and dates. Amazon Comprehend can detect the following entities:
COMMERCIAL_ITEM
DATE
EVENT
LOCATION
ORGANIZATION
PERSON
TITLE
The code below uses this feature to detect entities in the text, as shown below:
aws comprehend detect-entities \--text "Mexico is located to the South of the US." \--language-code "en"
When we run the command, we get two important parameters in response; Score
and Type
. The type defines the type of the detected entity. For example, Mexico will be an entity of type LOCATION in the example given above. Similarly, the score associated with each detected entity represents the degree of assurance that Amazon Comprehend has over the accuracy of the entity type detection. With this score, we can prevent inaccurate detections.
We can leverage Amazon Comprehend’s entity recognition capabilities to identify PII entities within the text and then implement a process to redact or remove those entities from the file.
Topic modeling can categorize multiple documents based on their topic. For example, we can use it to categorize news articles into nature, politics, medicine, etc. AWS Comprehend analyzes each word in a document. The set of words frequently corresponding to a particular context makes up the topic.
On AWS Comprehend, topic modeling is an asynchronous process. We provide the list of documents stored in an S3 bucket to the StartTopicsDetectionJob
operation, which returns results to an output S3 bucket.
Let’s see how we can create a Comprehend topic detection job using AWS CLI. First, set up input and output S3 buckets and add documents to the input S3 bucket. Also, create an IAM role that gives our topic detection job permission to read and write to the S3 buckets.
aws comprehend start-topics-detection-job \--number-of-topics topics to return \--job-name "job name" \--region region \--cli-input-json file://path to JSON input file
Here the cli-input-json
file will contain the JSON configurations of the input and output S3 buckets as well as the role with permissions to access the buckets:
{"InputDataConfig": {"S3Uri": "s3://input bucket/input path","InputFormat": "ONE_DOC_PER_FILE"},"OutputDataConfig": {"S3Uri": "s3://output bucket/output path"},"DataAccessRoleArn": "arn:aws:iam::account ID:role/data access role"}
We’ll pass the ARN of the IAM role with read-and-write access as the DataAccessRoleArn
in this configuration file.
Enter your AWS access_key_id
and secret_access_key
in the widget below before running any commands. If you don’t have these keys, follow the steps in this documentation to generate them.
Note: The IAM user whose credentials are used must have permission to perform all the required actions.
After the successful configuration of AWS, try out the above “language detection,” “sentiment analysis,” and “entity detection” commands in the terminal below:
Haven’t found what you were looking for? Contact Us
Free Resources