MLlib Batch Pipeline

Learn about machine learning libraries in PySpark to build predictive models.

Now that we’ve covered loading and transforming data with PySpark, we can use the machine learning libraries in PySpark to build a predictive model.

MLlib

The core library for building predictive models in PySpark is called MLlib. This library provides a suite of supervised and unsupervised algorithms.

While this library does not have complete coverage of all of the algorithms in sklearn, it provides functionality for the majority of the types of operations needed for data science workflows. In this section, we’ll show you how to apply MLlib to a classification problem and save the outputs from the model application to a data lake.

Press + to interact
games_df = spark.read.csv("s3://dsp-ch6/csv/games-expand.csv", header=True, inferSchema=True)
games_df.createOrReplaceTempView("games_df")
games_df = spark.sql("""
select *, row_number() over (order by rand()) as user_id
,case when rand() > 0.7 then 1 else 0 end as test
from games_df
""")

Loading the data

The first step in the pipeline is loading the dataset that we want to use for model training. ...