Data Analytics on AWS: An Architectural Guide/

...

Amazon EMR

Learn how to develop and run an Apache Spark application on Amazon EMR Serverless and understand other ways of using Amazon EMR.

We'll cover the following...

Running Apache Spark on EMR Serverless
Other ways of using Amazon EMR
Cleaning up

Amazon EMR (formerly known as Elastic MapReduce) is designed to run open-source big-data tools, including Apache Spark, Apache Hive, Presto, and more.

Launched in 2009, EMR initially supported running Apache Hadoop on Amazon’s EC2 and S3 infrastructures. For those familiar with various big-data open-source tools, EMR is Amazon’s solution to running those tools on the AWS platform. In addition, EMR supports open-source ML frameworks including TensorFlow, Apache MXNet, and Apache Spark MLlib.

Running Apache Spark on EMR Serverless

Launched in 2022, EMR Serverless is designed to make it simpler to run various big data applications on Amazon EMR. Let’s see how we can run an Apache Spark application on EMR Serverless.

Note: While this lesson is about Amazon EMR, open-source tools such as Apache Spark also run outside of the Amazon Web Services ecosystem.

Developing an Apache Spark application

The Python code below is an Apache Spark application that runs outside of Amazon EMR. The code is designed to accept a CSV file dwarf_activities.csv as input. Using the pyspark library, the code sets up a new Spark app session and proceeds to count both the total number of dwarfs in the CSV file and the number of tall dwarfs.

Press + to interact

Python 3.8

Files

import sys
from pyspark.sql import SparkSession # pip3 install pyspark
# get the path to the CSV file (which could be a parameter to the script)
if len(sys.argv) > 1:
  csvFile = sys.argv[1] # for example: s3://demo-emr-bucket/dwarf_activities.csv
else:
  csvFile = "dwarf_activities.csv"
# read and process the CSV file using PySpark (the Python API for Apache Spark)
spark = SparkSession.builder.appName("spark-demo").getOrCreate()
data = spark.read.csv(csvFile, header=True)
countTotalDwarfs = data.count()
# use SQL to count the taller dwarfs
HEIGHT_THRESHOLD_CM = 80
data.createOrReplaceTempView("dwarf_activities") # create in-memory view to query
sqlResults = spark.sql("""SELECT count(*) AS tall_dwarfs_count FROM dwarf_activities 
                          WHERE height_in_cm > """ + str(HEIGHT_THRESHOLD_CM))
countTallDwarfs = sqlResults.first()['tall_dwarfs_count']
# # Non-SQL version:
#
# from pyspark.sql.functions import col
# countTallDwarfs = data.filter(col('height_in_cm') > HEIGHT_THRESHOLD_CM).count()
print("Out of "+str(countTotalDwarfs)+" dwarfs, there are "+ str(countTallDwarfs) + " dwarfs taller than " + str(HEIGHT_THRESHOLD_CM) + " cm")
spark.stop()

Overview

Data Sources

Data Ingestion

Scalable Data Lake

Unified Governance

Seamless Data Movement

Purpose-Built Analytics and Insights

Wrap Up

Scalable Machine Learning Model for Accurate Predictions on AWS

Amazon EMR

Running Apache Spark on EMR Serverless

Developing an Apache Spark application