What is AWS Glue Crawler?

Key takeaways:

  • AWS Glue Crawler automates metadata extraction. It scans data sources, infers schema, and organizes metadata in the AWS Glue Data Catalog.

  • AWS Glue Crawler supports various data stores. It works with multiple AWS data storage systems such as Amazon S3, DynamoDB, MongoDB, and Delta Lake.

  • Proper IAM roles are required for access. The crawler needs IAM role permissions to access and process data within AWS services.

  • AWS Glue Crawler enables efficient querying and data analysis. By storing metadata in a structured format, it simplifies querying, access control, and data transformation.

  • AWS Glue Crawler detects changes in data structure. When run again, it identifies and updates any changes in schema or partitions.

Amazon Web Services (AWS) offers a powerful ETL (Extract, Transform, Load) tool called AWS Glue, designed to streamline the process of preparing and loading data into various AWS services. Whether you’re managing data lakes, performing analytics, or building machine learning pipelines, AWS Glue simplifies data integration by automating key tasks. One of its standout features is the AWS Glue Crawler, which discovers and organizes metadata about your data, making it easier to query, analyze, and manage.

In this Answer, we’ll explore what AWS Glue is, dive deep into how its crawler works with an S3 bucket, and walk through a practical example using a dataset of Netflix movies and TV shows. By the end, you’ll understand how to leverage this tool to unlock the full potential of your data in AWS.

What is AWS Glue?

AWS Glue is a fully managed ETL service that integrates seamlessly with other AWS offerings like Amazon S3, Redshift, and Athena. It handles three core functions:

  • Extract: Pulls data from various sources (e.g., S3, DynamoDB, MongoDB).

  • Transform: Cleans, enriches, or restructures data for downstream use.

  • Load: Deposits the processed data into a target AWS service.

Beyond ETL, AWS Glue catalogs your data by collecting and storing metadata—information about the data, such as its structure, datatypes, partitions, and schema. This metadata is stored in the AWS Glue Data Catalog, a centralized repository that acts as a metadata hub, enabling tools like Amazon Athena to query data efficiently.

Understanding the AWS Glue Crawler

The AWS Glue Crawler is a key component that automates metadata discovery. It scans your data sources, infers their structure, and populates the Data Catalog with organized tables. This eliminates the need to manually define schemas, saving time and reducing errors.

How does the crawler work?

  • Scanning: The crawler explores data in sources like S3 buckets, Delta Lakes, or DynamoDB. It navigates folder structures, identifies files, and reads their contents without altering them. For example, it can scan s3://my-bucket/movies/ to find partitioned CSVs.

  • Inference: It analyzes files to determine their format (e.g., CSV, JSON), partitions (e.g., year=2006), and column data types (e.g., title: string). By sampling data, it builds a schema automatically, adapting to variations like missing headers.

  • Storage: The crawler saves its findings as tables in the AWS Glue Data Catalog, detailing schema and locations. It creates new tables or updates existing ones, ensuring metadata like s3://my-bucket/movies/ is query-ready.

Example: Using AWS Glue Crawler with an S3 bucket

Let’s walk through a hands-on example of setting up an AWS Glue Crawler to catalog metadata from an S3 bucket. Our dataset consists of CSV files containing Netflix movies and TV shows, partitioned by release year.

1. Dataset

We will use the following dataset, which contains several CSVs of Netflix movies and TV shows partitioned according to their year of release.

This widget is not supported in dev-mode. Kindly enable it or run using yarn webapp:dev-widgets.

The following steps show how we use the crawler on our dataset.

2. Uploading data to S3 bucket

We first create an S3 bucket with a folder to which we upload our dataset. We can do this, using the two commands given below:

This widget is not supported in dev-mode. Kindly enable it or run using yarn webapp:dev-widgets.

The first command creates an S3 bucket, called educative-3213, while the second command creates a Movies folder within educative-3213.

Next, we will upload our dataset to the Movies folder in the S3 bucket using the following command:

This widget is not supported in dev-mode. Kindly enable it or run using yarn webapp:dev-widgets.

The recursive flag is used so that the command applies to all files and folders within a specific directory, which, in our case, are all the files and folders inside our local movies folder.

After running the commands above, we are able to see the S3 bucket, containing a Movies folder with all our data.

3. Creating a database in Glue

The crawler requires a database that it can use as an output directory; the metadata of any data is stored in a table inside this database.

In AWS Glue, we create a database, naming it crawler-metadata-educative, using the following command:

This widget is not supported in dev-mode. Kindly enable it or run using yarn webapp:dev-widgets.

After running the command above, we are able to see a new empty database on the “AWS Glue > Data Catalog > Databases” page, which we can get to by going to the AWS Glue homepage and clicking on “Databases” from the sidebar. This database will be pointed toward the Movies folder in the bucket we created earlier, primarily for monitoring purposes.

4. Creating an IAM role

The crawler needs several permissions to access the S3 bucket. We use an IAM Role for this.

AWS Identity and Access Management (IAM) role is a feature that gives selective permissions and access to several resources so that AWS services can temporarily gain the permissions defined by the permission policy attached to them. The AWS services that can assume the role are defined by the trust policy attached to them.

Every IAM role requires a trust policy, which specifies the features that can be undertaken by the given role. We use the following trust policy for our role.

This widget is not supported in dev-mode. Kindly enable it or run using yarn webapp:dev-widgets.

In the policy above, we specify that the action of AssumeRole can only be done by the service glue.

The role will also need permission policies attached to it so that it can have all the necessary access to resources.

The following two commands are used for the complete creation of our required IAM role.

This widget is not supported in dev-mode. Kindly enable it or run using yarn webapp:dev-widgets.

The first command is for creating an IAM role, named AWSGlueServiceRoleEduc, with the trust policy, written in the trust.json file. The second command attaches that role with the “AWSGlueServiceRole” permissions policy, which gives the role access to several required services, while the third command attaches that role with the “AmazonS3FullAccess,” which gives the role further access to S3 buckets.

After running the commands above, we can find our AWSGlueServiceRoleEduc listed on the “IAM > Roles” page. To access it, go to the IAM homepage and click “Roles” in the sidebar.

5. Creating a crawler

After the steps above, we now create the crawler we will be using. We can do this by using the following command.

This widget is not supported in dev-mode. Kindly enable it or run using yarn webapp:dev-widgets.

With the command above, we create a crawler, naming it movies-crawler-educative. We give it the location of our Movies folder in the S3 bucket as the data source; this will specify to the crawler which data it has to get the metadata of. We also specify the database crawler-metadata-educative as the database to use as output.

After running the above command, we find our crawler on the “AWS Glue > Crawlers” page, with its state being “Ready.” We can get to this page by going to the AWS Glue homepage and clicking on “Crawlers” from the sidebar.

6. Running the crawler

After our complete setup is complete, we finally run our crawler using the following command.

This widget is not supported in dev-mode. Kindly enable it or run using yarn webapp:dev-widgets.

When this command is run, the “AWS Glue > Crawlers” page shows the crawler movies-crawler-educative to be in a “Running” state. After some time, it changes to a “Stopping” state. Under “Table changes”, it should be showing “1 created,” meaning that a table has been created by the crawler during this run.

The crawler’s final state will be “Ready,” with the “Last run” showing a “Succeeded” sign.

7. Metadata table

By opening the movies-crawler-educative page we see, under “Table Changes”, that the crawler has made 1 new table, and has also has identified 13 different partitions.

The crawler we ran has saved the metadata information in the database we created and specified to the crawler. A new table, by the name of movies, has been created by the crawler within the database crawler-metadata-educative. The number of partitions in this table can be checked using the following command.

This widget is not supported in dev-mode. Kindly enable it or run using yarn webapp:dev-widgets.

The table has several information about our Movies data. It has identified all partitions, along with other information about our data, which can be seen in the “AWS Glue > Data Catalog > Databases > Tables > movies” page, which we can go to, by going on the AWS Glue homepage, clicking on “Tables” from the sidebar, and then choosing the movies table.

However, if we run the crawler again, no new table will be produced. This is because our data’s structure, along with other metadata components, would remain unchanged.

Practice

Enter your AWS AccessKeyID and AWS SecretAccessKey, and then run the commands given above, in the terminal below. If you don’t have these keys, follow the steps in this documentation under the “Managing access keys (console)” heading to generate the keys.

Note: Kindly remember the following instructions.

  • In the commands above, you should change the name of the bucket to make it globally unique. Every command using the bucket's name should reflect this change.

  • After running the command to run the crawler, wait for the state of the crawler to change to "Ready" before running the last command. This usually takes up to 2-3 minutes.

This widget is not supported in dev-mode. Kindly enable it or run using yarn webapp:dev-widgets.

Get hands-on experience with “Building ETL Pipelines on AWS” Cloud Lab and master the art of creating efficient ETL data pipelines with AWS Glue. Start now and transform raw data into actionable insights!

Benefits of using an AWS Glue Crawler

Here are the benefits of using an AWS Glue Crawler:

  • Automates metadata discovery: Scans and infers schemas/partitions, saving time.

  • Simplifies integration: Populates the Data Catalog for easy use with Athena or ETL tools.

  • Boosts query speed: Identifies partitions for faster, cost-effective queries.

  • Enhances governance: Enables secure, role-based access control.

  • Cuts costs: Reduces manual effort and resource usage.

Conclusion

AWS crawler is a useful tool for extracting and storing the metadata of any particular data. It stores the required information in an organized manner and can detect changes to the structure and partitions of data if it’s run again.

This widget is not supported in dev-mode. Kindly enable it or run using yarn webapp:dev-widgets.

Free Resources

Copyright ©2026 Educative, Inc. All rights reserved