What is AWS Glue Crawler?

Key takeaways:
AWS Glue Crawler automates metadata extraction. It scans data sources, infers schema, and organizes metadata in the AWS Glue Data Catalog.
AWS Glue Crawler supports various data stores. It works with multiple AWS data storage systems such as Amazon S3, DynamoDB, MongoDB, and Delta Lake.
Proper IAM roles are required for access. The crawler needs IAM role permissions to access and process data within AWS services.
AWS Glue Crawler enables efficient querying and data analysis. By storing metadata in a structured format, it simplifies querying, access control, and data transformation.
AWS Glue Crawler detects changes in data structure. When run again, it identifies and updates any changes in schema or partitions.

Amazon Web Services (AWS) offers a powerful ETL (Extract, Transform, Load) tool called AWS Glue, designed to streamline the process of preparing and loading data into various AWS services. Whether you’re managing data lakes, performing analytics, or building machine learning pipelines, AWS Glue simplifies data integration by automating key tasks. One of its standout features is the AWS Glue Crawler, which discovers and organizes metadata about your data, making it easier to query, analyze, and manage.

In this Answer, we’ll explore what AWS Glue is, dive deep into how its crawler works with an S3 bucket, and walk through a practical example using a dataset of Netflix movies and TV shows. By the end, you’ll understand how to leverage this tool to unlock the full potential of your data in AWS.

What is AWS Glue?

AWS Glue is a fully managed ETL service that integrates seamlessly with other AWS offerings like Amazon S3, Redshift, and Athena. It handles three core functions:

Extract: Pulls data from various sources (e.g., S3, DynamoDB, MongoDB).
Transform: Cleans, enriches, or restructures data for downstream use.
Load: Deposits the processed data into a target AWS service.

Beyond ETL, AWS Glue catalogs your data by collecting and storing metadata—information about the data, such as its structure, datatypes, partitions, and schema. This metadata is stored in the AWS Glue Data Catalog, a centralized repository that acts as a metadata hub, enabling tools like Amazon Athena to query data efficiently.

Understanding the AWS Glue Crawler

The AWS Glue Crawler is a key component that automates metadata discovery. It scans your data sources, infers their structure, and populates the Data Catalog with organized tables. This eliminates the need to manually define schemas, saving time and reducing errors.

How does the crawler work?

Scanning: The crawler explores data in sources like S3 buckets, Delta Lakes, or DynamoDB. It navigates folder structures, identifies files, and reads their contents without altering them. For example, it can scan s3://my-bucket/movies/ to find partitioned CSVs.
Inference: It analyzes files to determine their format (e.g., CSV, JSON), partitions (e.g., year=2006), and column data types (e.g., title: string). By sampling data, it builds a schema automatically, adapting to variations like missing headers.
Storage: The crawler saves its findings as tables in the AWS Glue Data Catalog, detailing schema and locations. It creates new tables or updates existing ones, ensuring metadata like s3://my-bucket/movies/ is query-ready.

Example: Using AWS Glue Crawler with an S3 bucket

Let’s walk through a hands-on example of setting up an AWS Glue Crawler to catalog metadata from an S3 bucket. Our dataset consists of CSV files containing Netflix movies and TV shows, partitioned by release year.

1. Dataset

We will use the following dataset, which contains several CSVs of Netflix movies and TV shows partitioned according to their year of release.

,show_id,type,title,director,country,date_added,release_year,rating,duration,listed_in
307,s5618,Movie,Happy New Year,Farah Khan,India,2/1/2017,2006,TV-14,179 min,"Action & Adventure, Comedies, Dramas"
306,s5617,Movie,Dilwale,Rohit Shetty,India,2/1/2017,2006,TV-PG,154 min,"Action & Adventure, Dramas, International Movies"
305,s5616,Movie,Imperial Dreams,Malik Vitthal,United States,2/3/2017,2006,TV-MA,86 min,Dramas
304,s5615,Movie,Daniel Sosa: Sosafado,"Raúl Campos, Jan Suter",Mexico,2/3/2017,2006,TV-MA,78 min,Stand-Up Comedy
303,s5614,Movie,"Michael Bolton's Big, Sexy Valentine's Day Special","Scott Aukerman, Akiva Schaffer",United States,2/7/2017,2006,TV-MA,54 min,"Comedies, Music & Musicals, Romantic Movies"
302,s5613,Movie,Hitler - A Career,"Joachim Fest, Christian Herrendoerfer",West Germany,2/10/2017,2006,TV-MA,150 min,"Documentaries, International Movies"
301,s5611,Movie,David Brent: Life on the Road,Ricky Gervais,United Kingdom,2/10/2017,2006,TV-MA,97 min,"Comedies, International Movies, Music & Musicals"
300,s5608,Movie,Katherine Ryan: In Trouble,Colin Dench,United Kingdom,2/14/2017,2006,TV-MA,64 min,Stand-Up Comedy
299,s5607,Movie,Girlfriend's Day,Michael Paul Stephenson,United States,2/14/2017,2006,TV-MA,71 min,"Comedies, Independent Movies"
298,s5606,Movie,The Memory of Water,Matías Bize,Chile,2/15/2017,2006,TV-MA,88 min,"Dramas, International Movies"
297,s5605,Movie,The Fury of a Patient Man,Raúl Arévalo,Spain,2/15/2017,2006,TV-MA,92 min,"International Movies, Thrillers"
296,s5603,Movie,Rush: Beyond the Lighted Stage,"Sam Dunn, Scot McFadyen",Canada,2/15/2017,2006,TV-MA,107 min,"Documentaries, Music & Musicals"
295,s5601,Movie,A Heavy Heart,Thomas Stuber,Germany,2/15/2017,2006,TV-MA,109 min,"Dramas, Independent Movies, International Movies"
294,s5600,Movie,Rocky Handsome,Nishikant Kamat,India,2/17/2017,2006,TV-MA,119 min,"Action & Adventure, International Movies"
293,s5598,Movie,Tini: The New Life of Violetta,Juan Pablo Buscarini,Spain,2/19/2017,2006,G,99 min,"Children & Family Movies, Music & Musicals"
292,s5597,Movie,Growing Up Wild,Keith Scholey,United States,2/19/2017,2006,G,78 min,"Children & Family Movies, Documentaries"
291,s5596,Movie,Boy Missing,Mar Targarona,Spain,2/19/2017,2006,TV-MA,105 min,"International Movies, Thrillers"
290,s5595,Movie,Trevor Noah: Afraid of the Dark,David Paul Meyer,United States,2/21/2017,2006,TV-19,67 min,Stand-Up Comedy
289,s5591,Movie,I Don't Feel at Home in This World Anymore,Macon Blair,United States,2/24/2017,2006,TV-MA,97 min,"Dramas, Independent Movies, Thrillers"
288,s5590,Movie,Operações Especiais,Tomas Portella,Brazil,2/25/2017,2006,TV-MA,99 min,"Action & Adventure, International Movies"
287,s5589,Movie,Jonas,Lô Politi,Brazil,2/26/2017,2006,TV-MA,97 min,"Dramas, International Movies"
286,s5588,Movie,Force 7,Abhinay Deo,India,2/27/2017,2006,TV-19,123 min,"Action & Adventure, International Movies"
285,s5587,Movie,Mike Birbiglia: Thank God for Jokes,"Seth Barrish, Mike Birbiglia",United States,2/28/2017,2006,TV-MA,71 min,Stand-Up Comedy
284,s5585,Movie,Nila,Selvamani Selvaraj,India,3/1/2017,2006,TV-MA,94 min,"Dramas, International Movies, Romantic Movies"
283,s5583,Movie,Amy Schumer: The Leather Special,Amy Schumer,United States,3/7/2017,2006,TV-MA,57 min,Stand-Up Comedy
282,s5582,Movie,The Butterfly's Dream,Yılmaz Erdoğan,Turkey,3/10/2017,2006,TV-PG,118 min,"Dramas, International Movies, Romantic Movies"
281,s5827,Movie,Real Crime: Diamond Geezers,Tom Whitter,United Kingdom,8/1/2016,2006,TV-18,46 min,Documentaries
280,s5825,Movie,Interview with a Serial Killer,Christopher Martin,United States,8/1/2016,2006,TV-MA,45 min,Documentaries
279,s5822,Movie,Children of God,John Smithson,United Kingdom,8/1/2016,2006,TV-MA,63 min,Documentaries
278,s5821,Movie,Lavell Crawford: Can a Brother Get Some Love?,Michael Drumm,United States,8/2/2016,2006,TV-MA,81 min,Stand-Up Comedy
277,s5820,Movie,David Cross: Making America Great Again!,Alex Coletti,United States,8/5/2016,2006,TV-MA,73 min,Stand-Up Comedy
276,s5819,Movie,Jim Gaffigan: Obsessed,Jay Chapman,United States,8/11/2016,2006,TV-14,70 min,Stand-Up Comedy
275,s5818,Movie,Jim Gaffigan: Mr. Universe,Jay Karas,United States,8/11/2016,2006,TV-14,77 min,Stand-Up Comedy
274,s5817,Movie,Jim Gaffigan: King Baby,Troy Miller,United States,8/11/2016,2006,TV-PG,71 min,Stand-Up Comedy
273,s5816,Movie,Jim Gaffigan: Beyond the Pale,Michael Drumm,United States,8/11/2016,2006,TV-18,72 min,Stand-Up Comedy
272,s5811,Movie,I'll Sleep When I'm Dead,Justin Krook,United States,8/19/2016,2006,TV-MA,80 min,"Documentaries, Music & Musicals"
271,s5810,Movie,XOXO,Christopher Louie,United States,8/26/2016,2006,TV-MA,92 min,"Dramas, Music & Musicals"
270,s5809,Movie,Jeff Foxworthy and Larry the Cable Guy: We’ve Been Thinking...,Jay Karas,United States,8/26/2016,2006,TV-18,75 min,Stand-Up Comedy
269,s5798,Movie,Extremis,Dan Krauss,United States,9/13/2016,2006,TV-PG,25 min,Documentaries
268,s5797,Movie,Sample This,Dan Forrer,United States,9/15/2016,2006,TV-18,83 min,"Documentaries, Music & Musicals"
267,s5796,Movie,The White Helmets,Orlando von Einsiedel,United Kingdom,9/16/2016,2006,TV-PG,41 min,Documentaries
266,s5794,Movie,Cedric the Entertainer: Live from the Ville,Troy Miller,United States,9/16/2016,2006,TV-MA,60 min,Stand-Up Comedy
265,s5793,Movie,ARQ,Tony Elliott,Canada,9/16/2016,2006,TV-MA,89 min,"International Movies, Sci-Fi & Fantasy, Thrillers"
264,s5785,Movie,Iliza Shlesinger: Confirmed Kills,Bobcat Goldthwait,United States,9/23/2016,2006,TV-MA,78 min,Stand-Up Comedy
263,s5784,Movie,Audrie & Daisy,"Bonni Cohen, Jon Shenk",United States,9/23/2016,2006,TV-MA,99 min,Documentaries
262,s5783,Movie,Amanda Knox,"Rod Blackhurst, Brian McGinn",Denmark,9/30/2016,2006,TV-MA,92 min,Documentaries
261,s5781,Movie,Welcome Mr. President,Riccardo Milani,Italy,10/1/2016,2006,TV-MA,99 min,"Comedies, International Movies"
260,s5780,Movie,Unchained: The Untold Story of Freestyle Motocross,"Paul Taublieb, Jon Freeman",United States,10/1/2016,2006,TV-MA,92 min,"Documentaries, Sports Movies"
259,s5779,Movie,Umrika,Prashant Nair,India,10/1/2016,2006,TV-MA,96 min,"Dramas, Independent Movies, International Movies"
258,s5777,Movie,Riphagen - The Untouchable,Pieter Kuijpers,Netherlands,10/1/2016,2006,TV-18,132 min,"Dramas, International Movies"
257,s5775,TV Show,Old Money,David Schalko,United States,10/1/2016,2006,TV-MA,5 Season,"International TV Shows, TV Comedies, TV Dramas"
256,s5774,Movie,My Little Pony Equestria Girls: Legend of Everfree,Ishi Rudell,United States,10/1/2016,2006,TV-Y11,73 min,"Children & Family Movies, Comedies"
255,s5773,Movie,My Big Night,Álex de la Iglesia,Spain,10/1/2016,2006,TV-MA,97 min,"Comedies, International Movies, Music & Musicals"
254,s5771,Movie,Much Ado About Nothing,Alejandro Fernández Almendras,Chile,10/1/2016,2006,TV-MA,96 min,"Dramas, Independent Movies, International Movies"
253,s5768,Movie,Harud,Aamir Bashir,India,10/1/2016,2006,TV-MA,100 min,"Dramas, International Movies"
252,s5766,Movie,Chatô: The King of Brazil,Guilherme Fontes,Brazil,10/1/2016,2006,TV-MA,105 min,"Dramas, International Movies"
251,s5765,Movie,Bombshell,Riccardo Pilizzeri,New Zealand,10/1/2016,2006,TV-MA,86 min,Dramas
250,s5762,Movie,LEGO Jurassic World: The Indominus Escape,Michael D. Black,United States,10/4/2016,2006,TV-Y11,25 min,"Children & Family Movies, Comedies"
249,s5761,Movie,The Siege of Jadotville,Richie Smyth,Ireland,10/7/2016,2006,TV-MA,108 min,"Action & Adventure, Dramas, International Movies"
248,s5757,Movie,Justin Timberlake + the Tennessee Kids,Jonathan Demme,United States,10/12/2016,2006,TV-MA,90 min,Music & Musicals
247,s5756,Movie,Mascots,Christopher Guest,United States,10/13/2016,2006,TV-MA,95 min,Comedies
246,s5755,Movie,Sky Ladder: The Art of Cai Guo-Qiang,Kevin MacDonald,United States,10/14/2016,2006,TV-MA,80 min,Documentaries
245,s5751,Movie,Blind Date,Clovis Cornillac,France,10/15/2016,2006,TV-14,91 min,"Comedies, International Movies, Music & Musicals"
244,s5750,Movie,Bleach the Movie: Hell Verse,Noriyuki Abe,Japan,10/15/2016,2006,TV-14,94 min,"Action & Adventure, Anime Features, Sci-Fi & Fantasy"
243,s5749,Movie,Bleach The Movie: Fade to Black,Noriyuki Abe,Japan,10/15/2016,2006,TV-PG,94 min,"Action & Adventure, Anime Features, Sci-Fi & Fantasy"
242,s5748,Movie,Berserk: The Golden Age Arc I - The Egg of the King,Toshiyuki Kubooka,Japan,10/15/2016,2006,TV-MA,77 min,"Action & Adventure, Anime Features, International Movies"
241,s5747,Movie,A Mighty Team,Thomas Sorriaux,France,10/15/2016,2006,TV-MA,97 min,"Comedies, International Movies, Sports Movies"
240,s5746,Movie,Joe Rogan: Triggered,Anthony Giordano,United States,10/21/2016,2006,TV-MA,64 min,Stand-Up Comedy
239,s5744,Movie,11 años,Roger Gual,Spain,10/27/2016,2006,TV-MA,77 min,"Dramas, International Movies"
238,s5743,Movie,West Coast,Benjamin Weill,France,10/28/2016,2006,TV-MA,81 min,"Comedies, Dramas, International Movies"
237,s5741,Movie,They Are Everywhere,Yvan Attal,France,10/28/2016,2006,TV-MA,110 min,"Comedies, International Movies"
236,s5740,Movie,The African Doctor,Julien Rambaldi,France,10/28/2016,2006,TV-18,94 min,"Comedies, Dramas, International Movies"
235,s5739,Movie,Into the Inferno,Werner Herzog,United Kingdom,10/28/2016,2006,TV-PG,107 min,Documentaries
234,s5738,Movie,I Am the Pretty Thing That Lives in the House,Osgood Perkins,Canada,10/28/2016,2006,TV-18,89 min,"Horror Movies, International Movies, Thrillers"
233,s5737,Movie,Pup Star,Robert Vince,Canada,10/29/2016,2006,G,92 min,"Children & Family Movies, Comedies"
232,s5736,Movie,Spanish Affair 6,Emilio Martínez Lázaro,Spain,11/1/2016,2006,TV-MA,107 min,"Comedies, International Movies, Romantic Movies"
231,s5734,Movie,Norman Lear: Just Another Version of You,"Heidi Ewing, Rachel Grady",United States,11/1/2016,2006,TV-MA,91 min,Documentaries
230,s5727,Movie,A Grand Night In: The Story of Aardman,Richard Mears,United Kingdom,11/1/2016,2006,TV-PG,59 min,Documentaries
229,s5726,Movie,The Ivory Game,"Kief Davidson, Richard Ladkani",Austria,11/4/2016,2006,TV-18,112 min,Documentaries
228,s5725,Movie,"Dana Carvey: Straight White Male, 64",Marcus Raboy,United States,11/4/2016,2006,TV-MA,64 min,Stand-Up Comedy
227,s5722,Movie,Kathleen Madigan: Bothering Jesus,Lorene Machado,United States,11/10/2016,2006,TV-MA,71 min,Stand-Up Comedy
226,s5735,Movie,Santa Pac's Merry Berry Day,Moto Sakakibara,Not Given,11/1/2016,2006,TV-Y,44 min,Movies
225,s5720,Movie,True Memoirs of an International Assassin,Jeff Wadlow,United States,11/11/2016,2006,TV-18,98 min,"Action & Adventure, Comedies"
224,s5719,Movie,Mumbai Cha Raja,Manjeet Singh,India,11/15/2016,2006,TV-MA,77 min,"Dramas, Independent Movies, International Movies"
223,s5715,Movie,Divines,Houda Benyamina,France,11/18/2016,2006,TV-MA,107 min,"Dramas, Independent Movies, International Movies"
222,s5714,Movie,Colin Quinn: The New York Story,Jerry Seinfeld,United States,11/18/2016,2006,TV-MA,62 min,Stand-Up Comedy

Dataset

The recursive flag is used so that the command applies to all files and folders within a specific directory, which, in our case, are all the files and folders inside our local movies folder.

After running the commands above, we are able to see the S3 bucket, containing a Movies folder with all our data.

3. Creating a database in Glue

The crawler requires a database that it can use as an output directory; the metadata of any data is stored in a table inside this database.

In AWS Glue, we create a database, naming it crawler-metadata-educative, using the following command:

After running the command above, we are able to see a new empty database on the “AWS Glue > Data Catalog > Databases” page, which we can get to by going to the AWS Glue homepage and clicking on “Databases” from the sidebar. This database will be pointed toward the Movies folder in the bucket we created earlier, primarily for monitoring purposes.

4. Creating an IAM role

The crawler needs several permissions to access the S3 bucket. We use an IAM Role for this.

AWS Identity and Access Management (IAM) role is a feature that gives selective permissions and access to several resources so that AWS services can temporarily gain the permissions defined by the permission policy attached to them. The AWS services that can assume the role are defined by the trust policy attached to them.

Every IAM role requires a trust policy, which specifies the features that can be undertaken by the given role. We use the following trust policy for our role.

The first command is for creating an IAM role, named AWSGlueServiceRoleEduc, with the trust policy, written in the trust.json file. The second command attaches that role with the “AWSGlueServiceRole” permissions policy, which gives the role access to several required services, while the third command attaches that role with the “AmazonS3FullAccess,” which gives the role further access to S3 buckets.

After running the commands above, we can find our AWSGlueServiceRoleEduc listed on the “IAM > Roles” page. To access it, go to the IAM homepage and click “Roles” in the sidebar.

5. Creating a crawler

After the steps above, we now create the crawler we will be using. We can do this by using the following command.

With the command above, we create a crawler, naming it movies-crawler-educative. We give it the location of our Movies folder in the S3 bucket as the data source; this will specify to the crawler which data it has to get the metadata of. We also specify the database crawler-metadata-educative as the database to use as output.

After running the above command, we find our crawler on the “AWS Glue > Crawlers” page, with its state being “Ready.” We can get to this page by going to the AWS Glue homepage and clicking on “Crawlers” from the sidebar.

6. Running the crawler

After our complete setup is complete, we finally run our crawler using the following command.

When this command is run, the “AWS Glue > Crawlers” page shows the crawler movies-crawler-educative to be in a “Running” state. After some time, it changes to a “Stopping” state. Under “Table changes”, it should be showing “1 created,” meaning that a table has been created by the crawler during this run.

The crawler’s final state will be “Ready,” with the “Last run” showing a “Succeeded” sign.

7. Metadata table

By opening the movies-crawler-educative page we see, under “Table Changes”, that the crawler has made 1 new table, and has also has identified 13 different partitions.

The crawler we ran has saved the metadata information in the database we created and specified to the crawler. A new table, by the name of movies, has been created by the crawler within the database crawler-metadata-educative. The number of partitions in this table can be checked using the following command.

The table has several information about our Movies data. It has identified all partitions, along with other information about our data, which can be seen in the “AWS Glue > Data Catalog > Databases > Tables > movies” page, which we can go to, by going on the AWS Glue homepage, clicking on “Tables” from the sidebar, and then choosing the movies table.

However, if we run the crawler again, no new table will be produced. This is because our data’s structure, along with other metadata components, would remain unchanged.

Practice

Enter your AWS AccessKeyID and AWS SecretAccessKey, and then run the commands given above, in the terminal below. If you don’t have these keys, follow the steps in this documentation under the “Managing access keys (console)” heading to generate the keys.

Note: Kindly remember the following instructions.
In the commands above, you should change the name of the bucket to make it globally unique. Every command using the bucket's name should reflect this change.
After running the command to run the crawler, wait for the state of the crawler to change to "Ready" before running the last command. This usually takes up to 2-3 minutes.

Get hands-on experience with “Building ETL Pipelines on AWS” Cloud Lab and master the art of creating efficient ETL data pipelines with AWS Glue. Start now and transform raw data into actionable insights!

Benefits of using an AWS Glue Crawler

Here are the benefits of using an AWS Glue Crawler:

Automates metadata discovery: Scans and infers schemas/partitions, saving time.
Simplifies integration: Populates the Data Catalog for easy use with Athena or ETL tools.
Boosts query speed: Identifies partitions for faster, cost-effective queries.
Enhances governance: Enables secure, role-based access control.
Cuts costs: Reduces manual effort and resource usage.

Conclusion

AWS crawler is a useful tool for extracting and storing the metadata of any particular data. It stores the required information in an organized manner and can detect changes to the structure and partitions of data if it’s run again.

Frequently asked questions

Haven’t found what you were looking for? Contact Us

What is AWS Glue used for?

AWS Glue is a fully managed ETL (Extract, Transform, Load) service that helps prepare and transfer data for analytics.

How does AWS Glue Crawler read your data?

It scans the data source, detects schema and partitions, and stores metadata in the AWS Glue Data Catalog.

How does a crawler work?

A crawler automatically identifies and extracts metadata by scanning files, inferring their schema, and creating structured tables.

What is the main function of AWS Glue?

AWS Glue simplifies data integration by automating schema discovery, ETL processes, and metadata management.

Does AWS Glue need coding?

AWS Glue supports both no-code options (visual interface) and coding-based custom ETL scripts using Python or Scala.

What is the difference between crawler and scraping?

A crawler extracts metadata and schema from structured data sources, while web scraping extracts raw data from web pages.

Is AWS Glue a ETL tool?

Yes, AWS Glue is a fully managed ETL tool designed to automate data extraction, transformation, and loading.

Is AWS Glue similar to Databricks?

Both handle data processing, but AWS Glue is serverless and optimized for ETL, while Databricks is a collaborative data and AI platform based on Apache Spark.

Can AWS Glue Crawlers detect changes in data sources?

Yes, AWS Glue Crawlers can detect changes in data sources. When a crawler runs, it scans the data source, infers the schema, and updates the AWS Glue Data Catalog accordingly.

How AWS Glue Crawlers detect changes:

Schema changes: Crawlers recognize new columns, removed columns, or changes in data types and update the Glue Data Catalog accordingly.
New partitions: If a table is partitioned, the crawler can detect new partitions and add them to the catalog.
New files or tables: If new datasets or tables are added to the data source, the crawler identifies and catalogs them.
Deleted data: While crawlers detect new data, they do not automatically remove deleted tables or partitions unless manually configured.

Free Resources