This device is not compatible.
You will learn to:
Use PySpark to build distributed computing projects.
Implement the Apriori algorithm for mining frequent itemsets.
Skills
Data Science
Distributed Architecture
Data Mining
Prerequisites
Intermediate Python coding skills
Familiarity with distributed computing concepts
Basic working knowledge of PySpark
Technologies
Python
PySpark
Project Description
Let’s say we run a grocery store and have a good amount of data from the point of sale. We want the sets of items frequently bought together to be placed on shelves near each other to boost sales and increase customer convenience. To achieve this, we can use the Apriori algorithm. It’s much faster than its brute-force variant and can be implemented in a distributed computing scenario.
We’ll first write the Python code for the parallel processing of dataset partitions at the worker nodes. We’ll then write the final central itemset frequency check by the master node. The code we’ll write can be run on a compute cluster for a full flavor of distributed computing.
Project Tasks
1
Getting Started
Task 0: Introduction
Task 1: Import the Libraries and Set Up the Environment
2
Distributed Combination Generation
Task 2: Generate Combinations—Parent Intersection Property
Task 3: Generate Combinations—Subset Frequency Property
Task 4: Count Check
Task 5: Generate k-Size Combinations
Task 6: Generate Singles
Task 7: The Worker Partition Mapper
3
Filtering at the Master Node
Task 8: Load Data and Preprocess
Task 9: The Distributed Transform
Task 10: Auxiliary Function to Check Presence
Task 11: Count Check at Master
Congratulations!
Relevant Courses
Use the following content to review prerequisites or explore specific concepts in detail.