type

project_id

private_key_id

private_key

client_email

client_id

auth_uri

token_uri

auth_provider_x509_cert_url

client_x509_cert_url

kaggle_username

kaggle_key

Gain insights into building scalable data and model pipelines, explore different cloud environments, delve into streaming workflows, and discover essential tools for creating real-time data products.

dsp-new.tar.gz

PROJECT_ID

aws_access_key_id

aws_secret_access_key

aws_account_id

jupyter-hello

code_widget_advanced

spa_job

spa_job_model_persistence

spa_job_gcloud

spa_job_gcloud_authen

spa_gcloudAuthen_sklearnWorkflows

spa_job_apacheKafka

spa_job_jupyter_gcloudAuth_pubsub

code_widget_basic

spa_job_aws

spa_job_gcloudAuth_GCP_Kubernetes

spa_job_jupyter

spa_job_gcloud_authen_model

spa_job_aws_model

spa_job_aws_echo_service

spa_job_Airflow

spa_job_aws_kaggle_nhl_data

spa_job_kaggle_nhl_data_jupyter

airflow_liveApp

spa_job_heroku

test-athar

spa_job_heroku-copy

core-jupyter

core-jupyter-copy

spa-jupyter

jupyter-hello-spa

osama

sklearn_workflow_job

spa_job_model_endpoint_keras

spa_job_aws-copy

spa_job_aws_model-copy

spa_job_aws_echo_service-copy

spa_job_aws_kaggle_nhl_data-copy

The goal of this course is to provide you with a set of tools that can be used to build predictive model services for product teams.

In this course, you’ll start by covering the different cloud environments and tools for building scalable data and model pipelines. You’ll then learn the different data sets and types of models that will be used heavily in everyday production. Throughout the course, you’ll have plenty of exercises and challenges to get you comfortable working with the diverse toolset.

Lastly, you’ll explore streaming model workflows which is crucial for building real-time data pipelines that move data between different components in a cloud environment. 


After working through this course, you will have gained valuable hands-on experience with many of the tools needed to build data products. You will also have a better understanding of how to build scalable machine learning pipelines in a cloud environment.

Data Science in Production: Building Scalable Model Pipelines


To implement the streaming model pipeline, we'll use PySpark with a Python UDF to apply model predictions as new elements arrive. 

# Example
A Python UDF operates on a single row, while a Pandas UDF operates on a partition
of rows. The code for this pipeline is shown in the PySpark snippet below, which first trains a model on the driver node,
sets up a data sink for a Kafka stream, defines a UDF for applying an ML model, and then publishes the scores 
to a new topic as a pipeline output.

> <img src="/api/collection/10370001/6066104977850368/page/4580102618742784/image/6441996354846720" width="35"> Replace `{external IP}` with the public IP of your machine or EC2 instance. 

from pyspark.sql.types import StringType
import json 
import pandas as pd
from sklearn.linear_model import LogisticRegression

# build a logistic regression model 
gamesDF = pd.read_csv("https://github.com/bgweber/Twitch/raw/master/Recommendations/games-expand.csv")
model = LogisticRegression() 
model.fit(gamesDF.iloc[:,0:10], gamesDF['label'])

# define the UDF for scoring users 
def score(row):
    d = json.loads(row)
    p = pd.DataFrame.from_dict(d, orient = "index").transpose()        
    pred = model.predict_proba(p.iloc[:,0:10])[0][0]
    result = {'User_ID': d['User_ID'], 'pred': pred }
    return str(json.dumps(result))

# read from Kafka 
df = spark.readStream.format("kafka") 
  .option("kafka.bootstrap.servers", "{external_ip}:9092") 
  .option("subscribe", "dsp").load()

# select the value field and apply the UDF     
df = df.selectExpr("CAST(value AS STRING)")
score_udf = udf(score, StringType())    
df = df.select( score_udf("value").alias("value"))

# Write results to Kafka 
query = df.writeStream.format("kafka") 
  .option("kafka.bootstrap.servers", "{external_ip}:9092") 
  .option("topic", "preds") 
  .option("checkpointLocation", "/temp").start()


The script first trains a logistic regression model using data fetched from GitHub. The model object is created on the driver node, but is copied to the worker nodes when used by the UDF. 

## Defining UDF
The next step is to define a UDF that we'll apply to streaming records in the pipeline. The Python UDF takes a string as input, converts the string to a dictionary using the `json` library, and then converts the dictionary into a Pandas dataframe. The dataframe is passed to the model object, and the UDF returns a string representation
of a dictionary object with the `User_ID` and `pred` keys, where the prediction value is the propensity of the user to purchase a specific game. 

## PySpark streaming workflow
The next three steps in the pipeline define the PySpark streaming workflow. 

1. The `readStream` call sets up the connection to the Kafka broker and subscribes to the `dsp` topic. 

2. Next, a select statement is used to cast the `value` column of the streaming records to a string before passing the value to the UDF. It then creates a new dataframe using the results of the Python UDF. 

3. The last step writes the output dataframe to the `preds` topic, using a local directory as a checkpoint location for Kafka. 

These three steps run as part of a continuous processing workflow, where the steps do not complete, but, instead, suspend execution until new data arrives. The result is a streaming DAG of operations that process data as it arrives.

## Visualizing details on Databricks
When running a streaming pipeline, Databricks will show details about the workflow below the cell, as shown in the figure below. The green icon identifies that this is a streaming operation that will continue to execute until terminated. There are also charts that visualize data throughput and latency. 

For a production pipeline, it's useful to run code using orchestration tools such as Airflow with the Databricks operator, but the notebook environment does provide a useful way to run and debug streaming pipelines. 

Build a streaming pipeline that applies a sklearn model in the second part of streaming in sklearn.

- Sklearn Streaming 2

Introduction to Building Scalable Model Pipelines

Models as Web Endpoints

Models as Serverless Functions

Create an Echo Function in Lambda

Working with S3 in Lambda

Working with API in Lambda

Containers for Reproducible Models

Working with AWS Container Registry

Workflow Tools for Model Pipelines

PySpark for Batch Pipelines

Cloud Dataflow for Batch Modeling

Streaming Model Workflows

Course Conclusion

- Sklearn Streaming 2

Example