Amazon Athena
Understand the basics of Amazon Athena and how to integrate it with AWS Glue to query data in Amazon S3.
Amazon Athena is a SQL query service for data stored in Amazon S3. Launched in 2016, Athena is based on Presto, an open-source SQL query engine.
Athena doesn’t require loading data outside of S3, though there is some schema setup required to properly query the data (similar to Redshift Spectrum).
Using Amazon Athena with AWS Glue
One of the faster ways to use Amazon Athena is through its integration with AWS Glue. Specifically, Athena can query databases and tables that have schemas (metadata definitions) stored in the AWS Glue Data Catalog.
Note: If you already have tables in the AWS Glue Data Catalog, you can jump ahead to the section Opening Athena from AWS Glue.
Setting up AWS Glue in our S3 account
Below is our dwarf_activities.csv
example file that we’ll upload to our S3 account.
full_name,current_activity,height_in_cmGrumpy Dwarf,watching television,89Sleepy Dwarf,sleeping in a hammock,82Sneezy Dwarf,drinking soup,84Doc Dwarf,eating an apple,77Bashful Dwarf,hiding at home,81Dopey Dwarf,reading a magazine,79Happy Dwarf,walking outside,88
We want to upload this CSV file to a uniquely named Amazon S3 bucket.
Our AWS account has a bucket “demo-s3-data-lake-bucket.” Within this bucket, we create a folder “dwarf_activities.” We upload the dwarf_activities.csv
file into this folder.
We then navigate to the “AWS Glue” area of the AWS Console and click the “Crawlers” page (under the “Data Catalog” section of the left-side menu). Click the “Create crawler” button to create a new crawler for a data source.
Here are our example crawler configurations:
Step 1: Set crawler properties.
Name: “dwarf_activities”
Step 2: ...