...

/

Feature Transformation Techniques in PySpark MLlib

Feature Transformation Techniques in PySpark MLlib

Learn different feature transformation techniques in PySpark MLlib.

Feature transformation is a critical step in the ML process that involves converting features from one representation to another to make them suitable for ML algorithms. PySpark MLlib provides a wide range of feature transformation techniques to assist in this process.

Here is an overview of some key feature transformation methods:

Tokenization

Tokenization is the feature transformation method in which we tokenize the “text” column by splitting it into individual terms (usually words). We used this in the previous lesson so we won’t spend much time on this. The input column is specified as “text,” and the output column is specified as “words.” We apply the transformation to the DataFrame using the transform method and store the result in the tokenized DataFrame.

The example below shows how to split sentences into sequences of words using the Amazon product reviews dataset.

Press + to interact
main.py
Amazon_Reviews.csv
# Import the necessary libraries
from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Read the dataset
print("Loading the Amazon Product Reviews dataset")
df = spark.read.csv("Amazon_Reviews.csv", header=True, inferSchema=True)
# Tokenize the text
print("Tokenizing the text")
tokenizer = Tokenizer(inputCol="Review", outputCol="words")
words = tokenizer.transform(df)
# Loop through the words DataFrame and print tokenized words
print("Tokenized words:")
for row in words.collect():
print(row["words"])

Here’s a breakdown of the code:

  • Lines 1–3: Import the required libraries, including SparkSession and Tokenizer.
  • Line 6: Create a SparkSession using the builder pattern.
  • Line 10: Read the dataset from the “Amazon_Reviews” file.
  • Lines 14–15: Tokenize the text using the Tokenizer Transformer.
  • Lines 18–20: Print the tokenized text.

Scaling and normalization

Data normalization, or feature normalization, is a crucial technique in feature transformation. It’s employed to bring all values of a particular feature or features onto the same scale. This standardization is especially important for certain machine learning algorithms that perform optimally when features share a similar scale. It prevents a single feature from dominating the learning process due to its larger magnitude.

Let’s explore some of the most commonly used data normalization methods.

StandardScaler

PySpark’s StandardScaler is a powerful tool for transforming a dataset by normalizing each feature to have a mean of zero and a standard deviation of one. This is particularly valuable when dealing with features of varying scales because the performance of machine learning algorithms can be influenced by feature scaling. By scaling features to have a mean of zero and a standard deviation of one, we ensure that all features are treated equally during the learning process. The StandardScaler tool acts as an Estimator, and it can be fit on a dataset to compute summary statistics, resulting in a StandardScalerModel.

To utilize the StandardScaler tool, we specify the input column containing the features to be scaled using inputCol, and we indicate the output column where the scaled features will be stored using outputCol. Once fitted, the StandardScalerModel can transform a Vector column in a dataset, producing scaled features.

Let’s witness the StandardScaler tool in action using the Amazon product review dataset.

Note: Before we demonstrate the StandardScaler tool, we’ll perform the necessary data preparation steps, which include tokenization and feature transformation using HashingTF and IDF.

Press + to interact
main.py
Amazon_Reviews.csv
# Import the necessary libraries
from pyspark.sql import SparkSession
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, StandardScaler
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Read the dataset
print("Loading the Amazon Product Reviews dataset")
df = spark.read.csv("Amazon_Reviews.csv", header=True, inferSchema=True)
# Tokenize the text
print("Tokenizing the text")
tokenizer = Tokenizer(inputCol="Review", outputCol="words")
words = tokenizer.transform(df)
# Apply the `HashingTF` Transformer
print("Applying HashingTF transformation to convert words into features")
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(words)
# Apply IDF to rescale the raw TF
print("Applying IDF to rescale the raw features")
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
# Transform the featurized data
print("Transforming the rescaled featured data")
rescaledData = idfModel.transform(featurizedData)
# StandardScaler
print("StandardScaler normalization of the rescaled features")
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
withStd=True, withMean=False)
scalerModel = scaler.fit(rescaledData)
scaledData = scalerModel.transform(rescaledData)
# Show the first row of scaled data
scaledData.show(1)

Here’s a breakdown of the code:

  • Lines 2–4: Import the necessary modules and create a SparkSession.
  • Line 10: Reads a dataset of Amazon reviews from a file named “Amazon_Reviews.csv.”
  • Lines 14–15: Tokenize the sentences by
...