What is AWS Glue Crawler?

AWS Glue is an ETL (Extract, transform, load) tool offered by Amazon Web Services. It helps you take data from different places, transform them, and then load them into other AWS services. It also finds and organizes metadata about your data to make it easier to understand and manage. Metadata is only information about the data, not the data itself; it covers details of the data, such as its datatypes, structure, partitions, etc. AWS Glue Datalog is a repository that stores metadata of a particular data.

Crawler is a feature of Glue, which is used to collect and store the metadata. It scans the particular data and infers its structure and other details without external help, storing the metadata in an organized manner in a newly created table. This can be useful for several tasks, such as writing queries, controlling access, or analyzing data. Crawler can be used with several data storage systems in AWS, such as S3 buckets, DynamoDB, MongoDB, Delta Lakes, etc.

In this Answer, we will see how a crawler can be used to infer and store the metadata of an S3 bucket in AWS.

Dataset

We will use the following dataset, containing several CSVs of movies and TV shows from Netflix, partitioned according to the year of release.

,show_id,type,title,director,country,date_added,release_year,rating,duration,listed_in
307,s5618,Movie,Happy New Year,Farah Khan,India,2/1/2017,2006,TV-14,179 min,"Action & Adventure, Comedies, Dramas"
306,s5617,Movie,Dilwale,Rohit Shetty,India,2/1/2017,2006,TV-PG,154 min,"Action & Adventure, Dramas, International Movies"
305,s5616,Movie,Imperial Dreams,Malik Vitthal,United States,2/3/2017,2006,TV-MA,86 min,Dramas
304,s5615,Movie,Daniel Sosa: Sosafado,"Raúl Campos, Jan Suter",Mexico,2/3/2017,2006,TV-MA,78 min,Stand-Up Comedy
303,s5614,Movie,"Michael Bolton's Big, Sexy Valentine's Day Special","Scott Aukerman, Akiva Schaffer",United States,2/7/2017,2006,TV-MA,54 min,"Comedies, Music & Musicals, Romantic Movies"
302,s5613,Movie,Hitler - A Career,"Joachim Fest, Christian Herrendoerfer",West Germany,2/10/2017,2006,TV-MA,150 min,"Documentaries, International Movies"
301,s5611,Movie,David Brent: Life on the Road,Ricky Gervais,United Kingdom,2/10/2017,2006,TV-MA,97 min,"Comedies, International Movies, Music & Musicals"
300,s5608,Movie,Katherine Ryan: In Trouble,Colin Dench,United Kingdom,2/14/2017,2006,TV-MA,64 min,Stand-Up Comedy
299,s5607,Movie,Girlfriend's Day,Michael Paul Stephenson,United States,2/14/2017,2006,TV-MA,71 min,"Comedies, Independent Movies"
298,s5606,Movie,The Memory of Water,Matías Bize,Chile,2/15/2017,2006,TV-MA,88 min,"Dramas, International Movies"
297,s5605,Movie,The Fury of a Patient Man,Raúl Arévalo,Spain,2/15/2017,2006,TV-MA,92 min,"International Movies, Thrillers"
296,s5603,Movie,Rush: Beyond the Lighted Stage,"Sam Dunn, Scot McFadyen",Canada,2/15/2017,2006,TV-MA,107 min,"Documentaries, Music & Musicals"
295,s5601,Movie,A Heavy Heart,Thomas Stuber,Germany,2/15/2017,2006,TV-MA,109 min,"Dramas, Independent Movies, International Movies"
294,s5600,Movie,Rocky Handsome,Nishikant Kamat,India,2/17/2017,2006,TV-MA,119 min,"Action & Adventure, International Movies"
293,s5598,Movie,Tini: The New Life of Violetta,Juan Pablo Buscarini,Spain,2/19/2017,2006,G,99 min,"Children & Family Movies, Music & Musicals"
292,s5597,Movie,Growing Up Wild,Keith Scholey,United States,2/19/2017,2006,G,78 min,"Children & Family Movies, Documentaries"
291,s5596,Movie,Boy Missing,Mar Targarona,Spain,2/19/2017,2006,TV-MA,105 min,"International Movies, Thrillers"
290,s5595,Movie,Trevor Noah: Afraid of the Dark,David Paul Meyer,United States,2/21/2017,2006,TV-19,67 min,Stand-Up Comedy
289,s5591,Movie,I Don't Feel at Home in This World Anymore,Macon Blair,United States,2/24/2017,2006,TV-MA,97 min,"Dramas, Independent Movies, Thrillers"
288,s5590,Movie,Operações Especiais,Tomas Portella,Brazil,2/25/2017,2006,TV-MA,99 min,"Action & Adventure, International Movies"
287,s5589,Movie,Jonas,Lô Politi,Brazil,2/26/2017,2006,TV-MA,97 min,"Dramas, International Movies"
286,s5588,Movie,Force 7,Abhinay Deo,India,2/27/2017,2006,TV-19,123 min,"Action & Adventure, International Movies"
285,s5587,Movie,Mike Birbiglia: Thank God for Jokes,"Seth Barrish, Mike Birbiglia",United States,2/28/2017,2006,TV-MA,71 min,Stand-Up Comedy
284,s5585,Movie,Nila,Selvamani Selvaraj,India,3/1/2017,2006,TV-MA,94 min,"Dramas, International Movies, Romantic Movies"
283,s5583,Movie,Amy Schumer: The Leather Special,Amy Schumer,United States,3/7/2017,2006,TV-MA,57 min,Stand-Up Comedy
282,s5582,Movie,The Butterfly's Dream,Yılmaz Erdoğan,Turkey,3/10/2017,2006,TV-PG,118 min,"Dramas, International Movies, Romantic Movies"
281,s5827,Movie,Real Crime: Diamond Geezers,Tom Whitter,United Kingdom,8/1/2016,2006,TV-18,46 min,Documentaries
280,s5825,Movie,Interview with a Serial Killer,Christopher Martin,United States,8/1/2016,2006,TV-MA,45 min,Documentaries
279,s5822,Movie,Children of God,John Smithson,United Kingdom,8/1/2016,2006,TV-MA,63 min,Documentaries
278,s5821,Movie,Lavell Crawford: Can a Brother Get Some Love?,Michael Drumm,United States,8/2/2016,2006,TV-MA,81 min,Stand-Up Comedy
277,s5820,Movie,David Cross: Making America Great Again!,Alex Coletti,United States,8/5/2016,2006,TV-MA,73 min,Stand-Up Comedy
276,s5819,Movie,Jim Gaffigan: Obsessed,Jay Chapman,United States,8/11/2016,2006,TV-14,70 min,Stand-Up Comedy
275,s5818,Movie,Jim Gaffigan: Mr. Universe,Jay Karas,United States,8/11/2016,2006,TV-14,77 min,Stand-Up Comedy
274,s5817,Movie,Jim Gaffigan: King Baby,Troy Miller,United States,8/11/2016,2006,TV-PG,71 min,Stand-Up Comedy
273,s5816,Movie,Jim Gaffigan: Beyond the Pale,Michael Drumm,United States,8/11/2016,2006,TV-18,72 min,Stand-Up Comedy
272,s5811,Movie,I'll Sleep When I'm Dead,Justin Krook,United States,8/19/2016,2006,TV-MA,80 min,"Documentaries, Music & Musicals"
271,s5810,Movie,XOXO,Christopher Louie,United States,8/26/2016,2006,TV-MA,92 min,"Dramas, Music & Musicals"
270,s5809,Movie,Jeff Foxworthy and Larry the Cable Guy: We’ve Been Thinking...,Jay Karas,United States,8/26/2016,2006,TV-18,75 min,Stand-Up Comedy
269,s5798,Movie,Extremis,Dan Krauss,United States,9/13/2016,2006,TV-PG,25 min,Documentaries
268,s5797,Movie,Sample This,Dan Forrer,United States,9/15/2016,2006,TV-18,83 min,"Documentaries, Music & Musicals"
267,s5796,Movie,The White Helmets,Orlando von Einsiedel,United Kingdom,9/16/2016,2006,TV-PG,41 min,Documentaries
266,s5794,Movie,Cedric the Entertainer: Live from the Ville,Troy Miller,United States,9/16/2016,2006,TV-MA,60 min,Stand-Up Comedy
265,s5793,Movie,ARQ,Tony Elliott,Canada,9/16/2016,2006,TV-MA,89 min,"International Movies, Sci-Fi & Fantasy, Thrillers"
264,s5785,Movie,Iliza Shlesinger: Confirmed Kills,Bobcat Goldthwait,United States,9/23/2016,2006,TV-MA,78 min,Stand-Up Comedy
263,s5784,Movie,Audrie & Daisy,"Bonni Cohen, Jon Shenk",United States,9/23/2016,2006,TV-MA,99 min,Documentaries
262,s5783,Movie,Amanda Knox,"Rod Blackhurst, Brian McGinn",Denmark,9/30/2016,2006,TV-MA,92 min,Documentaries
261,s5781,Movie,Welcome Mr. President,Riccardo Milani,Italy,10/1/2016,2006,TV-MA,99 min,"Comedies, International Movies"
260,s5780,Movie,Unchained: The Untold Story of Freestyle Motocross,"Paul Taublieb, Jon Freeman",United States,10/1/2016,2006,TV-MA,92 min,"Documentaries, Sports Movies"
259,s5779,Movie,Umrika,Prashant Nair,India,10/1/2016,2006,TV-MA,96 min,"Dramas, Independent Movies, International Movies"
258,s5777,Movie,Riphagen - The Untouchable,Pieter Kuijpers,Netherlands,10/1/2016,2006,TV-18,132 min,"Dramas, International Movies"
257,s5775,TV Show,Old Money,David Schalko,United States,10/1/2016,2006,TV-MA,5 Season,"International TV Shows, TV Comedies, TV Dramas"
256,s5774,Movie,My Little Pony Equestria Girls: Legend of Everfree,Ishi Rudell,United States,10/1/2016,2006,TV-Y11,73 min,"Children & Family Movies, Comedies"
255,s5773,Movie,My Big Night,Álex de la Iglesia,Spain,10/1/2016,2006,TV-MA,97 min,"Comedies, International Movies, Music & Musicals"
254,s5771,Movie,Much Ado About Nothing,Alejandro Fernández Almendras,Chile,10/1/2016,2006,TV-MA,96 min,"Dramas, Independent Movies, International Movies"
253,s5768,Movie,Harud,Aamir Bashir,India,10/1/2016,2006,TV-MA,100 min,"Dramas, International Movies"
252,s5766,Movie,Chatô: The King of Brazil,Guilherme Fontes,Brazil,10/1/2016,2006,TV-MA,105 min,"Dramas, International Movies"
251,s5765,Movie,Bombshell,Riccardo Pilizzeri,New Zealand,10/1/2016,2006,TV-MA,86 min,Dramas
250,s5762,Movie,LEGO Jurassic World: The Indominus Escape,Michael D. Black,United States,10/4/2016,2006,TV-Y11,25 min,"Children & Family Movies, Comedies"
249,s5761,Movie,The Siege of Jadotville,Richie Smyth,Ireland,10/7/2016,2006,TV-MA,108 min,"Action & Adventure, Dramas, International Movies"
248,s5757,Movie,Justin Timberlake + the Tennessee Kids,Jonathan Demme,United States,10/12/2016,2006,TV-MA,90 min,Music & Musicals
247,s5756,Movie,Mascots,Christopher Guest,United States,10/13/2016,2006,TV-MA,95 min,Comedies
246,s5755,Movie,Sky Ladder: The Art of Cai Guo-Qiang,Kevin MacDonald,United States,10/14/2016,2006,TV-MA,80 min,Documentaries
245,s5751,Movie,Blind Date,Clovis Cornillac,France,10/15/2016,2006,TV-14,91 min,"Comedies, International Movies, Music & Musicals"
244,s5750,Movie,Bleach the Movie: Hell Verse,Noriyuki Abe,Japan,10/15/2016,2006,TV-14,94 min,"Action & Adventure, Anime Features, Sci-Fi & Fantasy"
243,s5749,Movie,Bleach The Movie: Fade to Black,Noriyuki Abe,Japan,10/15/2016,2006,TV-PG,94 min,"Action & Adventure, Anime Features, Sci-Fi & Fantasy"
242,s5748,Movie,Berserk: The Golden Age Arc I - The Egg of the King,Toshiyuki Kubooka,Japan,10/15/2016,2006,TV-MA,77 min,"Action & Adventure, Anime Features, International Movies"
241,s5747,Movie,A Mighty Team,Thomas Sorriaux,France,10/15/2016,2006,TV-MA,97 min,"Comedies, International Movies, Sports Movies"
240,s5746,Movie,Joe Rogan: Triggered,Anthony Giordano,United States,10/21/2016,2006,TV-MA,64 min,Stand-Up Comedy
239,s5744,Movie,11 años,Roger Gual,Spain,10/27/2016,2006,TV-MA,77 min,"Dramas, International Movies"
238,s5743,Movie,West Coast,Benjamin Weill,France,10/28/2016,2006,TV-MA,81 min,"Comedies, Dramas, International Movies"
237,s5741,Movie,They Are Everywhere,Yvan Attal,France,10/28/2016,2006,TV-MA,110 min,"Comedies, International Movies"
236,s5740,Movie,The African Doctor,Julien Rambaldi,France,10/28/2016,2006,TV-18,94 min,"Comedies, Dramas, International Movies"
235,s5739,Movie,Into the Inferno,Werner Herzog,United Kingdom,10/28/2016,2006,TV-PG,107 min,Documentaries
234,s5738,Movie,I Am the Pretty Thing That Lives in the House,Osgood Perkins,Canada,10/28/2016,2006,TV-18,89 min,"Horror Movies, International Movies, Thrillers"
233,s5737,Movie,Pup Star,Robert Vince,Canada,10/29/2016,2006,G,92 min,"Children & Family Movies, Comedies"
232,s5736,Movie,Spanish Affair 6,Emilio Martínez Lázaro,Spain,11/1/2016,2006,TV-MA,107 min,"Comedies, International Movies, Romantic Movies"
231,s5734,Movie,Norman Lear: Just Another Version of You,"Heidi Ewing, Rachel Grady",United States,11/1/2016,2006,TV-MA,91 min,Documentaries
230,s5727,Movie,A Grand Night In: The Story of Aardman,Richard Mears,United Kingdom,11/1/2016,2006,TV-PG,59 min,Documentaries
229,s5726,Movie,The Ivory Game,"Kief Davidson, Richard Ladkani",Austria,11/4/2016,2006,TV-18,112 min,Documentaries
228,s5725,Movie,"Dana Carvey: Straight White Male, 64",Marcus Raboy,United States,11/4/2016,2006,TV-MA,64 min,Stand-Up Comedy
227,s5722,Movie,Kathleen Madigan: Bothering Jesus,Lorene Machado,United States,11/10/2016,2006,TV-MA,71 min,Stand-Up Comedy
226,s5735,Movie,Santa Pac's Merry Berry Day,Moto Sakakibara,Not Given,11/1/2016,2006,TV-Y,44 min,Movies
225,s5720,Movie,True Memoirs of an International Assassin,Jeff Wadlow,United States,11/11/2016,2006,TV-18,98 min,"Action & Adventure, Comedies"
224,s5719,Movie,Mumbai Cha Raja,Manjeet Singh,India,11/15/2016,2006,TV-MA,77 min,"Dramas, Independent Movies, International Movies"
223,s5715,Movie,Divines,Houda Benyamina,France,11/18/2016,2006,TV-MA,107 min,"Dramas, Independent Movies, International Movies"
222,s5714,Movie,Colin Quinn: The New York Story,Jerry Seinfeld,United States,11/18/2016,2006,TV-MA,62 min,Stand-Up Comedy
Dataset

AWS

The following steps show how we use the crawler on our dataset.

Uploading data to S3 bucket

We first create an S3 bucket, with a folder in which we upload our dataset to. We can do this, using the two commands given below:

aws s3api create-bucket --bucket educative-3213
aws s3api put-object --bucket educative-3213 --key Movies/ --content-length 0
Command to create bucket and a folder in it

The first command creates an S3 bucket, called educative-3213, while the second command creates a Movies folder within educative-3213.

Next, we will upload our dataset to the Movies folder in the S3 bucket using the following command:

aws s3 cp movies s3://educative-3213/Movies/ --recursive
Command to upload dataset in the S3 bucket

The recursive flag is used so that the command applies on all files and folders within a specific directory, which, in our case, are all the files and folders inside our local movies folder.

After running the commands above, we are able to see the S3 bucket, containing a Movies folder with all our data.

Creating database in Glue

The crawler requires a database which it can use as an output directory; the metadata of any data is stored in a table inside this database.

In AWS Glue, we create a database, naming it crawler-metadata-educative, using the following command:

aws glue create-database --database-input "{\"Name\":\"crawler-metadata-educative\", \"LocationUri\":\"s3://educative-3213/Movies/\"}"
Command to create database

After running the command above, we are able to see a new empty database on the “AWS Glue > Data Catalog > Databases” page, which we can get to by going to the AWS Glue homepage, and clicking on “Databases” from the sidebar. This database will be pointed towards the Movies folder in the bucket we created earlier, primarily for monitoring purpose.

Creating an IAM role

For the crawler to be able to access the S3 bucket, it would need several permissions. For this we use an IAM Role.

AWS Identity and Access Management (IAM) role is a feature that gives selective permissions and access to several resources, so that AWS services can assume these roles to temporarily gain the permissions defined by the permission policy attached to it. The AWS services which can assume the role, are defined by the trust policy attached to it.

Every IAM role requires a trust policy, which specifies the features that can undertake the given role. We use the following trust policy for our role.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "glue.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
Trust policy for IAM role

In the policy above, we specify that the action of AssumeRole can only be done by the service glue.

The role will also need permission policies to be attached to it, so that it can get all the necessary access to resources it would require.

The following two commands are used for the complete creation of our required IAM role.

aws iam create-role --role-name AWSGlueServiceRoleEduc --assume-role-policy-document file://trust.json
aws iam attach-role-policy \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole \
--role-name AWSGlueServiceRoleEduc
aws iam attach-role-policy \
--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess \
--role-name AWSGlueServiceRoleEduc
Commands to create IAM role

The first command is for creating an IAM role, named AWSGlueServiceRoleEduc, with the trust policy, written in the trust.json file. The second command attaches that role with the “AWSGlueServiceRole” permissions policy, which gives the role access to several required services, while the third command attaches that role with the “AmazonS3FullAccess,” which gives the role further access to S3 buckets.

After running the commands above, we can find our AWSGlueServiceRoleEduc as one of the roles, on the “IAM > Roles” page, which we can get to, by going on the IAM homepage, and clicking on “Roles” from the sidebar.

Creating a crawler

After the steps above, we now create the crawler we will be using. We can do this by using the following command.

aws glue create-crawler \
--name movies-crawler-educative --role AWSGlueServiceRoleEduc \
--targets '{"S3Targets": [{"Path": "s3://educative-3213/Movies/"}]}' \
--database-name crawler-metadata-educative
Command to create a crawler

With the command above, we create a crawler, naming it movies-crawler-educative. We give it the location of our Movies folder in the S3 bucket as the data source; this will specify to the crawler which data it has to get the metadata of. We also specify the database crawler-metadata-educative as the database to use as output.

After running the above command, we find our crawler in the “AWS Glue > Crawlers” page, with its state being “Ready”. We can get to this page, by going on the AWS Glue homepage, and clicking on “Crawlers” from the sidebar.

Running the crawler

After our complete setup is complete, we finally run our crawler using the following command.

aws glue start-crawler --name movies-crawler-educative
Command to run the crawler

When this command is run, the “AWS Glue > Crawlers” page shows the crawler movies-crawler-educative to be in “Running” state. After some time, it changes to “Stopping” state. Under “Table changes”, it should be showing '1 created', meaning that a table has been created by crawler during this run.

The final state of the crawler will be “Ready”, with the “Last run” showing a “Succeeded” sign.

Metadata table

By opening the movies-crawler-educative page we see, under “Table Changes”, that the crawler has made 1 new table, and has also has identified 13 different partitions.

The crawler we ran has saved the metadata information in the database we created and specified to the crawler. A new table, by the name of movies, has been created by the crawler within the database crawler-metadata-educative. The number of partitions in this table can be checked using the following command.

aws glue get-partitions \
--database-name crawler-metadata-educative --table-name movies \
--query 'length(Partitions[])'
Checking number of partitions

The table has several information about our Movies data. It has identified all partitions, along with other information about our data, which can be seen in the “AWS Glue > Data Catalog > Databases > Tables > movies” page, which we can go to, by going on the AWS Glue homepage, clicking on “Tables” from the sidebar, and then choosing the movies table.

However, if we run the crawler again, no new table will be produced. This is because our data's structure, along with other metadata components, would remain unchanged.

Practice

Enter your AWS AccessKeyID and AWS SecretAccessKey, and then run the commands given above, in the terminal below. If you don’t have these keys, follow the steps in this documentation, under “Managing access keys (console)” heading, to generate the keys.

Note: Kindly remember the following instructions.

  • In the commands above, you should change the name of the bucket to make it globally unique. Every command using the bucket's name should reflect this change.

  • After running the command to run the crawler, wait for the state of the crawler to change to "Ready" before running the last command. This usually takes up to 2-3 minutes.

Terminal 1
Terminal
Loading...

Conclusion

AWS crawler is a useful tool to extract and store the metadata of any particular data. It stores the required information in an organised manner, while also being able to detect changes to the structure and partitions of data if it's run again.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved