Feature Transformation Techniques in PySpark MLlib
Learn different feature transformation techniques in PySpark MLlib.
We'll cover the following...
Feature transformation is a critical step in the ML process that involves converting features from one representation to another to make them suitable for ML algorithms. PySpark MLlib
provides a wide range of feature transformation techniques to assist in this process.
Here is an overview of some key feature transformation methods:
Tokenization
Tokenization is the feature transformation method in which we tokenize the “text” column by splitting it into individual terms (usually words). We used this in the previous lesson so we won’t spend much time on this. The input column is specified as “text,” and the output column is specified as “words.” We apply the transformation to the DataFrame using the transform method and store the result in the tokenized DataFrame.
The example below shows how to split sentences into sequences of words using the Amazon product reviews dataset.
# Import the necessary librariesfrom pyspark.sql import SparkSessionfrom pyspark.ml.feature import Tokenizer# Create a SparkSessionspark = SparkSession.builder.getOrCreate()# Read the datasetprint("Loading the Amazon Product Reviews dataset")df = spark.read.csv("Amazon_Reviews.csv", header=True, inferSchema=True)# Tokenize the textprint("Tokenizing the text")tokenizer = Tokenizer(inputCol="Review", outputCol="words")words = tokenizer.transform(df)# Loop through the words DataFrame and print tokenized wordsprint("Tokenized words:")for row in words.collect():print(row["words"])
Here’s a breakdown of the code:
- Lines 1–3: Import the required libraries, including
SparkSession
andTokenizer
. - Line 6: Create a
SparkSession
using thebuilder
pattern. - Line 10: Read the dataset from the “Amazon_Reviews” file.
- Lines 14–15: Tokenize the text using the
Tokenizer
Transformer. - Lines 18–20: Print the tokenized text.
Scaling and normalization
Data normalization, or feature normalization, is a crucial technique in feature transformation. It’s employed to bring all values of a particular feature or features onto the same scale. This standardization is especially important for certain machine learning algorithms that perform optimally when features share a similar scale. It prevents a single feature from dominating the learning process due to its larger magnitude.
Let’s explore some of the most commonly used data normalization methods.
StandardScaler
PySpark’s StandardScaler
is a powerful tool for transforming a dataset by normalizing each feature to have a mean of zero and a standard deviation of one. This is particularly valuable when dealing with features of varying scales because the performance of machine learning algorithms can be influenced by feature scaling. By scaling features to have a mean of zero and a standard deviation of one, we ensure that all features are treated equally during the learning process. The StandardScaler
tool acts as an Estimator, and it can be fit on a dataset to compute summary statistics, resulting in a StandardScalerModel
.
To utilize the StandardScaler
tool, we specify the input column containing the features to be scaled using inputCol
, and we indicate the output column where the scaled features will be stored using outputCol
. Once fitted, the StandardScalerModel
can transform a Vector column in a dataset, producing scaled features.
Let’s witness the StandardScaler
tool in action using the Amazon product review dataset.
Note: Before we demonstrate the
StandardScaler
tool, we’ll perform the necessary data preparation steps, which include tokenization and feature transformation usingHashingTF
andIDF
.
# Import the necessary librariesfrom pyspark.sql import SparkSessionfrom pyspark.ml.feature import HashingTF, IDF, Tokenizer, StandardScaler# Create a SparkSessionspark = SparkSession.builder.getOrCreate()# Read the datasetprint("Loading the Amazon Product Reviews dataset")df = spark.read.csv("Amazon_Reviews.csv", header=True, inferSchema=True)# Tokenize the textprint("Tokenizing the text")tokenizer = Tokenizer(inputCol="Review", outputCol="words")words = tokenizer.transform(df)# Apply the `HashingTF` Transformerprint("Applying HashingTF transformation to convert words into features")hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)featurizedData = hashingTF.transform(words)# Apply IDF to rescale the raw TFprint("Applying IDF to rescale the raw features")idf = IDF(inputCol="rawFeatures", outputCol="features")idfModel = idf.fit(featurizedData)# Transform the featurized dataprint("Transforming the rescaled featured data")rescaledData = idfModel.transform(featurizedData)# StandardScalerprint("StandardScaler normalization of the rescaled features")scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",withStd=True, withMean=False)scalerModel = scaler.fit(rescaledData)scaledData = scalerModel.transform(rescaledData)# Show the first row of scaled datascaledData.show(1)
Here’s a breakdown of the code:
- Lines 2–4: Import the necessary modules and create a SparkSession.
- Line 10: Reads a dataset of Amazon reviews from a file named “Amazon_Reviews.csv.”
- Lines 14–15: Tokenize the sentences by