Tokenization

Tokenization is the feature transformation method in which we tokenize the “text” column by splitting it into individual terms (usually words). We used this in the previous lesson so we won’t spend much time on this. The input column is specified as “text,” and the output column is specified as “words.” We apply the transformation to the DataFrame using the transform method and store the result in the tokenized DataFrame.

The example below shows how to split sentences into sequences of words using the Amazon product reviews dataset.

Python 3.8

Files

# Import the necessary libraries
from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Read the dataset
print("Loading the Amazon Product Reviews dataset")
df = spark.read.csv("Amazon_Reviews.csv", header=True, inferSchema=True)
# Tokenize the text
print("Tokenizing the text")
tokenizer = Tokenizer(inputCol="Review", outputCol="words")
words = tokenizer.transform(df)
# Loop through the words DataFrame and print tokenized words
print("Tokenized words:")
for row in words.collect():
    print(row["words"])

Scaling and normalization

Data normalization, or feature normalization, is a crucial technique in feature transformation. It’s employed to bring all values of a particular feature or features onto the same scale. This standardization is especially important for certain machine learning algorithms that perform optimally when features share a similar scale. It prevents a single feature from dominating the learning process due to its larger magnitude.

Let’s explore some of the most commonly used data normalization methods.

StandardScaler

PySpark’s StandardScaler is a powerful tool for transforming a dataset by normalizing each feature to have a mean of zero and a standard deviation of one. This is particularly valuable when dealing with features of varying scales because the performance of machine learning algorithms can be influenced by feature scaling. By scaling features to have a mean of zero and a standard deviation of one, we ensure that all features are treated equally during the learning process. The StandardScaler tool acts as an Estimator, and it can be fit on a dataset to compute summary statistics, resulting in a StandardScalerModel.

To utilize the StandardScaler tool, we specify the input column containing the features to be scaled using inputCol, and we indicate the output column where the scaled features will be stored using outputCol. Once fitted, the StandardScalerModel can transform a Vector column in a dataset, producing scaled features.

Let’s witness the StandardScaler tool in action using the Amazon product review dataset.

Note: Before we demonstrate the StandardScaler tool, we’ll perform the necessary data preparation steps, which include tokenization and feature transformation using HashingTF and IDF.

Python 3.8

Files

# Import the necessary libraries
from pyspark.sql import SparkSession
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, StandardScaler
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Read the dataset
print("Loading the Amazon Product Reviews dataset")
df = spark.read.csv("Amazon_Reviews.csv", header=True, inferSchema=True)
# Tokenize the text
print("Tokenizing the text")
tokenizer = Tokenizer(inputCol="Review", outputCol="words")
words = tokenizer.transform(df)
# Apply the `HashingTF` Transformer
print("Applying HashingTF transformation to convert words into features")
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(words)
# Apply IDF to rescale the raw TF
print("Applying IDF to rescale the raw features")
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
# Transform the featurized data
print("Transforming the rescaled featured data")
rescaledData = idfModel.transform(featurizedData)
# StandardScaler
print("StandardScaler normalization of the rescaled features")
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                        withStd=True, withMean=False)
scalerModel = scaler.fit(rescaledData)
scaledData = scalerModel.transform(rescaledData)
# Show the first row of scaled data
scaledData.show(1)

Introduction to the Course

Introduction to Big Data

Exploring PySpark Core and RDDs

PySpark DataFrames and SQL

Customer Churn Analysis Using PySpark

Machine Learning with PySpark

Modeling with PySpark MLlib

Predicting Diabetes in Patients Using PySpark MLlib

Performance Optimization in PySpark

PySpark Optimization: Analyzing NYC Restaurants Data

Integrating PySpark with Other Big Data Tools

Wrap Up

Apriori Algorithm for Finding Frequent Itemsets with PySpark

Feature Transformation Techniques in PySpark MLlib

Tokenization

Scaling and normalization

StandardScaler