- Model Pipeline

Using GCP components with PySpark to build a pipeline.

To read in the natality dataset, we can use the read function with the Avro setting to fetch the dataset. Since we are using the Avro format, the dataframe will be lazily loaded and the data is not retrieved until the display command is used to sample the dataset, as shown in the snippet below:

Press + to interact
natality_path = "gs://dsp_model_store/natality/avro"
natality_df = spark.read.format("avro").load(natality_path)
display(natality_df)

Data transformation

Before we can use MLlib to build a regression model, we need to perform a few transformations on the dataset to select a subset of the features, cast data types, and split records into training and test groups. We’ll also use the fillna function, as shown below in order to replace any null values in the ...