Search⌘ K

Amazon Athena

Explore how to use Amazon Athena as a serverless SQL query service to analyze data stored in Amazon S3. Understand how to integrate Athena with AWS Glue Data Catalog for schema management and data crawling. Learn to run queries, manage query results, and utilize advanced features like Python notebooks, workgroups, and federated queries to extend Athena's capabilities for data analytics.

Amazon Athena is a SQL query service for data stored in Amazon S3. Launched in 2016, Athena is based on Presto, an open-source SQL query engine.

Athena doesn’t require loading data outside of S3, though there is some schema setup required to properly query the data (similar to Redshift Spectrum).

Using Amazon Athena with AWS Glue

One of the faster ways to use Amazon Athena is through its integration with AWS Glue. Specifically, Athena can query databases and tables that have schemas (metadata definitions) stored in the AWS Glue Data Catalog.

Note: If you already have tables in the AWS Glue Data Catalog, you can jump ahead to the section Opening Athena from AWS Glue.

Setting up AWS Glue in our S3 account

Below is our dwarf_activities.csv example file that we’ll upload to our S3 account.

XML

We want to upload this CSV file to a uniquely named Amazon S3 bucket.

Our AWS account has a bucket “demo-s3-data-lake-bucket.” Within this bucket, we create a folder “dwarf_activities.” We upload the dwarf_activities.csv file into this folder.

S3 folder where we uploaded the example CSV file
S3 folder where we uploaded the example CSV file

We then navigate to the “AWS Glue” area of the AWS Console and click the “Crawlers” page (under the “Data Catalog” section of the left-side menu). Click the “Create crawler” button to create a new crawler for a data source.

The AWS Glue page with a button to create a new crawler for a data source
The AWS Glue page with a button to create a new crawler for a data source

Here are our example crawler configurations:

  • Step 1: Set crawler properties.

    • Name: “dwarf_activities”

  • Step 2: Choose data sources and classifiers.

    • Data source: S3

    • S3 path: ...