Amazon EMR

Learn how to develop and run an Apache Spark application on Amazon EMR Serverless and understand other ways of using Amazon EMR.

Amazon EMR (formerly known as Elastic MapReduce) is designed to run open-source big-data tools, including Apache Spark, Apache Hive, Presto, and more.

Launched in 2009, EMR initially supported running Apache Hadoop on Amazon’s EC2 and S3 infrastructures. For those familiar with various big-data open-source tools, EMR is Amazon’s solution to running those tools on the AWS platform. In addition, EMR supports open-source ML frameworks including TensorFlow, Apache MXNet, and Apache Spark MLlib.

Running Apache Spark on EMR Serverless

Launched in 2022, EMR Serverless is designed to make it simpler to run various big data applications on Amazon EMR. Let’s see how we can run an Apache Spark application on EMR Serverless.

Note: While this lesson is about Amazon EMR, open-source tools such as Apache Spark also run outside of the Amazon Web Services ecosystem.

Developing an Apache Spark application

The Python code below is an Apache Spark application that runs outside of Amazon EMR. The code is designed to accept a CSV file dwarf_activities.csv as input. Using the pyspark library, the code sets up a new Spark app session and proceeds to count both the total number of dwarfs in the CSV file and the number of tall dwarfs.

Get hands-on with 1200+ tech skills courses.