AWS Glue is an ETL (Extract, transform, load) tool offered by Amazon Web Services. It helps you take data from different places, transform them, and then load them into other AWS services. It also finds and organizes metadata about your data to make it easier to understand and manage. Metadata is only information about the data, not the data itself; it covers details of the data, such as its datatypes, structure, partitions, etc. AWS Glue Datalog is a repository that stores metadata of a particular data.
Crawler is a feature of Glue, which is used to collect and store the metadata. It scans the particular data and infers its structure and other details without external help, storing the metadata in an organized manner in a newly created table. This can be useful for several tasks, such as writing queries, controlling access, or analyzing data. Crawler can be used with several data storage systems in AWS, such as S3 buckets, DynamoDB, MongoDB, Delta Lakes, etc.
In this Answer, we will see how a crawler can be used to infer and store the metadata of an S3 bucket in AWS.
We will use the following dataset, containing several CSVs of movies and TV shows from Netflix, partitioned according to the year of release.
,show_id,type,title,director,country,date_added,release_year,rating,duration,listed_in 307,s5618,Movie,Happy New Year,Farah Khan,India,2/1/2017,2006,TV-14,179 min,"Action & Adventure, Comedies, Dramas" 306,s5617,Movie,Dilwale,Rohit Shetty,India,2/1/2017,2006,TV-PG,154 min,"Action & Adventure, Dramas, International Movies" 305,s5616,Movie,Imperial Dreams,Malik Vitthal,United States,2/3/2017,2006,TV-MA,86 min,Dramas 304,s5615,Movie,Daniel Sosa: Sosafado,"Raúl Campos, Jan Suter",Mexico,2/3/2017,2006,TV-MA,78 min,Stand-Up Comedy 303,s5614,Movie,"Michael Bolton's Big, Sexy Valentine's Day Special","Scott Aukerman, Akiva Schaffer",United States,2/7/2017,2006,TV-MA,54 min,"Comedies, Music & Musicals, Romantic Movies" 302,s5613,Movie,Hitler - A Career,"Joachim Fest, Christian Herrendoerfer",West Germany,2/10/2017,2006,TV-MA,150 min,"Documentaries, International Movies" 301,s5611,Movie,David Brent: Life on the Road,Ricky Gervais,United Kingdom,2/10/2017,2006,TV-MA,97 min,"Comedies, International Movies, Music & Musicals" 300,s5608,Movie,Katherine Ryan: In Trouble,Colin Dench,United Kingdom,2/14/2017,2006,TV-MA,64 min,Stand-Up Comedy 299,s5607,Movie,Girlfriend's Day,Michael Paul Stephenson,United States,2/14/2017,2006,TV-MA,71 min,"Comedies, Independent Movies" 298,s5606,Movie,The Memory of Water,Matías Bize,Chile,2/15/2017,2006,TV-MA,88 min,"Dramas, International Movies" 297,s5605,Movie,The Fury of a Patient Man,Raúl Arévalo,Spain,2/15/2017,2006,TV-MA,92 min,"International Movies, Thrillers" 296,s5603,Movie,Rush: Beyond the Lighted Stage,"Sam Dunn, Scot McFadyen",Canada,2/15/2017,2006,TV-MA,107 min,"Documentaries, Music & Musicals" 295,s5601,Movie,A Heavy Heart,Thomas Stuber,Germany,2/15/2017,2006,TV-MA,109 min,"Dramas, Independent Movies, International Movies" 294,s5600,Movie,Rocky Handsome,Nishikant Kamat,India,2/17/2017,2006,TV-MA,119 min,"Action & Adventure, International Movies" 293,s5598,Movie,Tini: The New Life of Violetta,Juan Pablo Buscarini,Spain,2/19/2017,2006,G,99 min,"Children & Family Movies, Music & Musicals" 292,s5597,Movie,Growing Up Wild,Keith Scholey,United States,2/19/2017,2006,G,78 min,"Children & Family Movies, Documentaries" 291,s5596,Movie,Boy Missing,Mar Targarona,Spain,2/19/2017,2006,TV-MA,105 min,"International Movies, Thrillers" 290,s5595,Movie,Trevor Noah: Afraid of the Dark,David Paul Meyer,United States,2/21/2017,2006,TV-19,67 min,Stand-Up Comedy 289,s5591,Movie,I Don't Feel at Home in This World Anymore,Macon Blair,United States,2/24/2017,2006,TV-MA,97 min,"Dramas, Independent Movies, Thrillers" 288,s5590,Movie,Operações Especiais,Tomas Portella,Brazil,2/25/2017,2006,TV-MA,99 min,"Action & Adventure, International Movies" 287,s5589,Movie,Jonas,Lô Politi,Brazil,2/26/2017,2006,TV-MA,97 min,"Dramas, International Movies" 286,s5588,Movie,Force 7,Abhinay Deo,India,2/27/2017,2006,TV-19,123 min,"Action & Adventure, International Movies" 285,s5587,Movie,Mike Birbiglia: Thank God for Jokes,"Seth Barrish, Mike Birbiglia",United States,2/28/2017,2006,TV-MA,71 min,Stand-Up Comedy 284,s5585,Movie,Nila,Selvamani Selvaraj,India,3/1/2017,2006,TV-MA,94 min,"Dramas, International Movies, Romantic Movies" 283,s5583,Movie,Amy Schumer: The Leather Special,Amy Schumer,United States,3/7/2017,2006,TV-MA,57 min,Stand-Up Comedy 282,s5582,Movie,The Butterfly's Dream,Yılmaz Erdoğan,Turkey,3/10/2017,2006,TV-PG,118 min,"Dramas, International Movies, Romantic Movies" 281,s5827,Movie,Real Crime: Diamond Geezers,Tom Whitter,United Kingdom,8/1/2016,2006,TV-18,46 min,Documentaries 280,s5825,Movie,Interview with a Serial Killer,Christopher Martin,United States,8/1/2016,2006,TV-MA,45 min,Documentaries 279,s5822,Movie,Children of God,John Smithson,United Kingdom,8/1/2016,2006,TV-MA,63 min,Documentaries 278,s5821,Movie,Lavell Crawford: Can a Brother Get Some Love?,Michael Drumm,United States,8/2/2016,2006,TV-MA,81 min,Stand-Up Comedy 277,s5820,Movie,David Cross: Making America Great Again!,Alex Coletti,United States,8/5/2016,2006,TV-MA,73 min,Stand-Up Comedy 276,s5819,Movie,Jim Gaffigan: Obsessed,Jay Chapman,United States,8/11/2016,2006,TV-14,70 min,Stand-Up Comedy 275,s5818,Movie,Jim Gaffigan: Mr. Universe,Jay Karas,United States,8/11/2016,2006,TV-14,77 min,Stand-Up Comedy 274,s5817,Movie,Jim Gaffigan: King Baby,Troy Miller,United States,8/11/2016,2006,TV-PG,71 min,Stand-Up Comedy 273,s5816,Movie,Jim Gaffigan: Beyond the Pale,Michael Drumm,United States,8/11/2016,2006,TV-18,72 min,Stand-Up Comedy 272,s5811,Movie,I'll Sleep When I'm Dead,Justin Krook,United States,8/19/2016,2006,TV-MA,80 min,"Documentaries, Music & Musicals" 271,s5810,Movie,XOXO,Christopher Louie,United States,8/26/2016,2006,TV-MA,92 min,"Dramas, Music & Musicals" 270,s5809,Movie,Jeff Foxworthy and Larry the Cable Guy: We’ve Been Thinking...,Jay Karas,United States,8/26/2016,2006,TV-18,75 min,Stand-Up Comedy 269,s5798,Movie,Extremis,Dan Krauss,United States,9/13/2016,2006,TV-PG,25 min,Documentaries 268,s5797,Movie,Sample This,Dan Forrer,United States,9/15/2016,2006,TV-18,83 min,"Documentaries, Music & Musicals" 267,s5796,Movie,The White Helmets,Orlando von Einsiedel,United Kingdom,9/16/2016,2006,TV-PG,41 min,Documentaries 266,s5794,Movie,Cedric the Entertainer: Live from the Ville,Troy Miller,United States,9/16/2016,2006,TV-MA,60 min,Stand-Up Comedy 265,s5793,Movie,ARQ,Tony Elliott,Canada,9/16/2016,2006,TV-MA,89 min,"International Movies, Sci-Fi & Fantasy, Thrillers" 264,s5785,Movie,Iliza Shlesinger: Confirmed Kills,Bobcat Goldthwait,United States,9/23/2016,2006,TV-MA,78 min,Stand-Up Comedy 263,s5784,Movie,Audrie & Daisy,"Bonni Cohen, Jon Shenk",United States,9/23/2016,2006,TV-MA,99 min,Documentaries 262,s5783,Movie,Amanda Knox,"Rod Blackhurst, Brian McGinn",Denmark,9/30/2016,2006,TV-MA,92 min,Documentaries 261,s5781,Movie,Welcome Mr. President,Riccardo Milani,Italy,10/1/2016,2006,TV-MA,99 min,"Comedies, International Movies" 260,s5780,Movie,Unchained: The Untold Story of Freestyle Motocross,"Paul Taublieb, Jon Freeman",United States,10/1/2016,2006,TV-MA,92 min,"Documentaries, Sports Movies" 259,s5779,Movie,Umrika,Prashant Nair,India,10/1/2016,2006,TV-MA,96 min,"Dramas, Independent Movies, International Movies" 258,s5777,Movie,Riphagen - The Untouchable,Pieter Kuijpers,Netherlands,10/1/2016,2006,TV-18,132 min,"Dramas, International Movies" 257,s5775,TV Show,Old Money,David Schalko,United States,10/1/2016,2006,TV-MA,5 Season,"International TV Shows, TV Comedies, TV Dramas" 256,s5774,Movie,My Little Pony Equestria Girls: Legend of Everfree,Ishi Rudell,United States,10/1/2016,2006,TV-Y11,73 min,"Children & Family Movies, Comedies" 255,s5773,Movie,My Big Night,Álex de la Iglesia,Spain,10/1/2016,2006,TV-MA,97 min,"Comedies, International Movies, Music & Musicals" 254,s5771,Movie,Much Ado About Nothing,Alejandro Fernández Almendras,Chile,10/1/2016,2006,TV-MA,96 min,"Dramas, Independent Movies, International Movies" 253,s5768,Movie,Harud,Aamir Bashir,India,10/1/2016,2006,TV-MA,100 min,"Dramas, International Movies" 252,s5766,Movie,Chatô: The King of Brazil,Guilherme Fontes,Brazil,10/1/2016,2006,TV-MA,105 min,"Dramas, International Movies" 251,s5765,Movie,Bombshell,Riccardo Pilizzeri,New Zealand,10/1/2016,2006,TV-MA,86 min,Dramas 250,s5762,Movie,LEGO Jurassic World: The Indominus Escape,Michael D. Black,United States,10/4/2016,2006,TV-Y11,25 min,"Children & Family Movies, Comedies" 249,s5761,Movie,The Siege of Jadotville,Richie Smyth,Ireland,10/7/2016,2006,TV-MA,108 min,"Action & Adventure, Dramas, International Movies" 248,s5757,Movie,Justin Timberlake + the Tennessee Kids,Jonathan Demme,United States,10/12/2016,2006,TV-MA,90 min,Music & Musicals 247,s5756,Movie,Mascots,Christopher Guest,United States,10/13/2016,2006,TV-MA,95 min,Comedies 246,s5755,Movie,Sky Ladder: The Art of Cai Guo-Qiang,Kevin MacDonald,United States,10/14/2016,2006,TV-MA,80 min,Documentaries 245,s5751,Movie,Blind Date,Clovis Cornillac,France,10/15/2016,2006,TV-14,91 min,"Comedies, International Movies, Music & Musicals" 244,s5750,Movie,Bleach the Movie: Hell Verse,Noriyuki Abe,Japan,10/15/2016,2006,TV-14,94 min,"Action & Adventure, Anime Features, Sci-Fi & Fantasy" 243,s5749,Movie,Bleach The Movie: Fade to Black,Noriyuki Abe,Japan,10/15/2016,2006,TV-PG,94 min,"Action & Adventure, Anime Features, Sci-Fi & Fantasy" 242,s5748,Movie,Berserk: The Golden Age Arc I - The Egg of the King,Toshiyuki Kubooka,Japan,10/15/2016,2006,TV-MA,77 min,"Action & Adventure, Anime Features, International Movies" 241,s5747,Movie,A Mighty Team,Thomas Sorriaux,France,10/15/2016,2006,TV-MA,97 min,"Comedies, International Movies, Sports Movies" 240,s5746,Movie,Joe Rogan: Triggered,Anthony Giordano,United States,10/21/2016,2006,TV-MA,64 min,Stand-Up Comedy 239,s5744,Movie,11 años,Roger Gual,Spain,10/27/2016,2006,TV-MA,77 min,"Dramas, International Movies" 238,s5743,Movie,West Coast,Benjamin Weill,France,10/28/2016,2006,TV-MA,81 min,"Comedies, Dramas, International Movies" 237,s5741,Movie,They Are Everywhere,Yvan Attal,France,10/28/2016,2006,TV-MA,110 min,"Comedies, International Movies" 236,s5740,Movie,The African Doctor,Julien Rambaldi,France,10/28/2016,2006,TV-18,94 min,"Comedies, Dramas, International Movies" 235,s5739,Movie,Into the Inferno,Werner Herzog,United Kingdom,10/28/2016,2006,TV-PG,107 min,Documentaries 234,s5738,Movie,I Am the Pretty Thing That Lives in the House,Osgood Perkins,Canada,10/28/2016,2006,TV-18,89 min,"Horror Movies, International Movies, Thrillers" 233,s5737,Movie,Pup Star,Robert Vince,Canada,10/29/2016,2006,G,92 min,"Children & Family Movies, Comedies" 232,s5736,Movie,Spanish Affair 6,Emilio Martínez Lázaro,Spain,11/1/2016,2006,TV-MA,107 min,"Comedies, International Movies, Romantic Movies" 231,s5734,Movie,Norman Lear: Just Another Version of You,"Heidi Ewing, Rachel Grady",United States,11/1/2016,2006,TV-MA,91 min,Documentaries 230,s5727,Movie,A Grand Night In: The Story of Aardman,Richard Mears,United Kingdom,11/1/2016,2006,TV-PG,59 min,Documentaries 229,s5726,Movie,The Ivory Game,"Kief Davidson, Richard Ladkani",Austria,11/4/2016,2006,TV-18,112 min,Documentaries 228,s5725,Movie,"Dana Carvey: Straight White Male, 64",Marcus Raboy,United States,11/4/2016,2006,TV-MA,64 min,Stand-Up Comedy 227,s5722,Movie,Kathleen Madigan: Bothering Jesus,Lorene Machado,United States,11/10/2016,2006,TV-MA,71 min,Stand-Up Comedy 226,s5735,Movie,Santa Pac's Merry Berry Day,Moto Sakakibara,Not Given,11/1/2016,2006,TV-Y,44 min,Movies 225,s5720,Movie,True Memoirs of an International Assassin,Jeff Wadlow,United States,11/11/2016,2006,TV-18,98 min,"Action & Adventure, Comedies" 224,s5719,Movie,Mumbai Cha Raja,Manjeet Singh,India,11/15/2016,2006,TV-MA,77 min,"Dramas, Independent Movies, International Movies" 223,s5715,Movie,Divines,Houda Benyamina,France,11/18/2016,2006,TV-MA,107 min,"Dramas, Independent Movies, International Movies" 222,s5714,Movie,Colin Quinn: The New York Story,Jerry Seinfeld,United States,11/18/2016,2006,TV-MA,62 min,Stand-Up Comedy
The following steps show how we use the crawler on our dataset.
We first create an S3 bucket, with a folder in which we upload our dataset to. We can do this, using the two commands given below:
aws s3api create-bucket --bucket educative-3213aws s3api put-object --bucket educative-3213 --key Movies/ --content-length 0
The first command creates an S3 bucket, called educative-3213
, while the second command creates a Movies
folder within educative-3213
.
Next, we will upload our dataset to the Movies
folder in the S3 bucket using the following command:
aws s3 cp movies s3://educative-3213/Movies/ --recursive
The recursive
flag is used so that the command applies on all files and folders within a specific directory, which, in our case, are all the files and folders inside our local movies
folder.
After running the commands above, we are able to see the S3 bucket, containing a Movies
folder with all our data.
The crawler requires a database which it can use as an output directory; the metadata of any data is stored in a table inside this database.
In AWS Glue, we create a database, naming it crawler-metadata-educative
, using the following command:
aws glue create-database --database-input "{\"Name\":\"crawler-metadata-educative\", \"LocationUri\":\"s3://educative-3213/Movies/\"}"
After running the command above, we are able to see a new empty database on the “AWS Glue > Data Catalog > Databases” page, which we can get to by going to the AWS Glue homepage, and clicking on “Databases” from the sidebar. This database will be pointed towards the Movies
folder in the bucket we created earlier, primarily for monitoring purpose.
For the crawler to be able to access the S3 bucket, it would need several permissions. For this we use an IAM Role.
AWS Identity and Access Management (IAM) role is a feature that gives selective permissions and access to several resources, so that AWS services can assume these roles to temporarily gain the permissions defined by the permission policy attached to it. The AWS services which can assume the role, are defined by the trust policy attached to it.
Every IAM role requires a trust policy, which specifies the features that can undertake the given role. We use the following trust policy for our role.
{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Principal": {"Service": "glue.amazonaws.com"},"Action": "sts:AssumeRole"}]}
In the policy above, we specify that the action of AssumeRole
can only be done by the service glue
.
The role will also need permission policies to be attached to it, so that it can get all the necessary access to resources it would require.
The following two commands are used for the complete creation of our required IAM role.
aws iam create-role --role-name AWSGlueServiceRoleEduc --assume-role-policy-document file://trust.jsonaws iam attach-role-policy \--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole \--role-name AWSGlueServiceRoleEducaws iam attach-role-policy \--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess \--role-name AWSGlueServiceRoleEduc
The first command is for creating an IAM role, named AWSGlueServiceRoleEduc
, with the trust policy, written in the trust.json
file. The second command attaches that role with the “AWSGlueServiceRole” permissions policy, which gives the role access to several required services, while the third command attaches that role with the “AmazonS3FullAccess,” which gives the role further access to S3 buckets.
After running the commands above, we can find our AWSGlueServiceRoleEduc
as one of the roles, on the “IAM > Roles” page, which we can get to, by going on the IAM homepage, and clicking on “Roles” from the sidebar.
After the steps above, we now create the crawler we will be using. We can do this by using the following command.
aws glue create-crawler \--name movies-crawler-educative --role AWSGlueServiceRoleEduc \--targets '{"S3Targets": [{"Path": "s3://educative-3213/Movies/"}]}' \--database-name crawler-metadata-educative
With the command above, we create a crawler, naming it movies-crawler-educative
. We give it the location of our Movies
folder in the S3 bucket as the data source; this will specify to the crawler which data it has to get the metadata of. We also specify the database crawler-metadata-educative
as the database to use as output.
After running the above command, we find our crawler in the “AWS Glue > Crawlers” page, with its state being “Ready”. We can get to this page, by going on the AWS Glue homepage, and clicking on “Crawlers” from the sidebar.
After our complete setup is complete, we finally run our crawler using the following command.
aws glue start-crawler --name movies-crawler-educative
When this command is run, the “AWS Glue > Crawlers” page shows the crawler movies-crawler-educative
to be in “Running” state. After some time, it changes to “Stopping” state. Under “Table changes”, it should be showing '1 created', meaning that a table has been created by crawler during this run.
The final state of the crawler will be “Ready”, with the “Last run” showing a “Succeeded” sign.
By opening the movies-crawler-educative
page we see, under “Table Changes”, that the crawler has made 1 new table, and has also has identified 13 different partitions.
The crawler we ran has saved the metadata information in the database we created and specified to the crawler. A new table, by the name of movies
, has been created by the crawler within the database crawler-metadata-educative
. The number of partitions in this table can be checked using the following command.
aws glue get-partitions \--database-name crawler-metadata-educative --table-name movies \--query 'length(Partitions[])'
The table has several information about our Movies
data. It has identified all partitions, along with other information about our data, which can be seen in the “AWS Glue > Data Catalog > Databases > Tables > movies
” page, which we can go to, by going on the AWS Glue homepage, clicking on “Tables” from the sidebar, and then choosing the movies
table.
However, if we run the crawler again, no new table will be produced. This is because our data's structure, along with other metadata components, would remain unchanged.
Enter your AWS AccessKeyID
and AWS SecretAccessKey
, and then run the commands given above, in the terminal below. If you don’t have these keys, follow the steps in this documentation, under “Managing access keys (console)” heading, to generate the keys.
Note: Kindly remember the following instructions.
In the commands above, you should change the name of the bucket to make it globally unique. Every command using the bucket's name should reflect this change.
After running the command to run the crawler, wait for the state of the crawler to change to "Ready" before running the last command. This usually takes up to 2-3 minutes.
AWS crawler is a useful tool to extract and store the metadata of any particular data. It stores the required information in an organised manner, while also being able to detect changes to the structure and partitions of data if it's run again.
Free Resources