Mastering Big Data with PySpark/

...

Solution: Customer Churn Analysis Using PySpark

The solution to the customer churn analysis and predictions using PySpark.

We'll cover the following...

Task 1: Loading customer data into a PySpark DataFrame
Task 2: Preprocessing and transformation of data
Task 3: EDA

Press + to interact

Python 3.8

Files

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, count, corr, avg
spark = SparkSession.builder.getOrCreate()
# Task1: Loading Customer Data into a PySpark DataFrame
print("Reading 'churn.csv' into spark_df:")
telco_df = spark.read.csv("./churn.csv", header=True, inferSchema=True)
print("First 5 rows of the telco_df:")
telco_df.show(2, truncate=False, vertical=True)
print("Schema of the telco_df:")
telco_df.printSchema()
# Task 2: Preprocessing and Transformation of Data
churn_count = telco_df.filter(col("Churn Value") == 1).count()
print("Counting the number of churned customers:",churn_count)
print("Computing the average monthly charges by gender:")
telco_df.groupBy("Gender").avg("Monthly Charges").show()
print("Creating a new column 'Total Charges' by multiplying Monthly Charges and Tenure Months:")
telco_df.withColumn("Total Charges", col("Monthly Charges") * col("Tenure Months")).show(2, vertical=True)
print("Computing the correlation between Monthly Charges and Total Charges:")
telco_df.select(corr(col("Monthly Charges"), col("Total Charges"))).show()
# Task 3: EDA
print("Calculating the churn rates by contract type:")
telco_df.groupBy("Contract").agg((sum("Churn Value") / telco_df.count()).alias("Aggregated Churn Value")).show()
print("Calculating the average tenure by churn value:")
telco_df.groupBy("Churn Value").agg(avg("Tenure Months").alias("Aggregated Churn Value")).show()
print("Calculating the churn rates by payment method:")
telco_df.groupBy("Payment Method").agg((sum("Churn Value") / telco_df.count()).alias("Aggregated Churn Value")).show()

Introduction to the Course

Introduction to Big Data

Exploring PySpark Core and RDDs

PySpark DataFrames and SQL

Customer Churn Analysis Using PySpark

Machine Learning with PySpark

Modeling with PySpark MLlib

Predicting Diabetes in Patients Using PySpark MLlib

Performance Optimization in PySpark

PySpark Optimization: Analyzing NYC Restaurants Data

Integrating PySpark with Other Big Data Tools

Wrap Up

Apriori Algorithm for Finding Frequent Itemsets with PySpark

Solution: Customer Churn Analysis Using PySpark

Task 1: Loading customer data into a PySpark DataFrame