Staging Data
Getting data from Kaggle to Spark clusters.
PySpark workflows
Data is essential for PySpark workflows. Spark supports a variety of methods for reading in datasets, including connecting to data lakes and data warehouses, as well as loading sample datasets from libraries, such as the Boston housing dataset. Since the theme of this course is building scalable pipelines, we’ll focus on using data layers that work with distributed workflows.
To get started with PySpark, we’ll stage input data for a model pipeline on S3 and then read in the dataset as a Spark dataframe.
This lesson demonstrates how to stage data to S3, set up credentials for accessing the data from Spark, and fetch the data from S3 into a Spark dataframe.
Setting up an S3 bucket
The first step is to set up a bucket on S3 for storing the dataset we want to load. To perform this step, run the following operations on the command line.
aws s3api create-bucket --bucket dsp-ch6 --region us-east-1aws s3 ls
...