Feature Extraction Techniques in PySpark MLlib
Understand functions in PySpark for feature extraction.
We'll cover the following...
Feature extraction is a crucial step in machine learning, where we transform raw data into a format that can be easily understood and processed by ML algorithms. It plays a vital role in ML because it helps to transform raw data into meaningful representations that capture the underlying patterns and characteristics of the data. These extracted features can then be used as input to ML models for training and prediction. PySpark MLlib
provides a rich set of feature extraction techniques to handle diverse types of data.
Here are some common feature extraction techniques available in PySpark MLlib
.
Hashing-TF (Hashing Term Frequency)
Hashing-TF is a technique used for feature extraction in NLP and text analysis. It is particularly useful when we want to convert a collection of text documents into numerical feature vectors efficiently while minimizing memory usage.
In Hashing-TF, the process involves two key steps:
-
Tokenization: Tokenization is the process of breaking down the text into its constituent parts, such as words or phrases.
-
Hashing: Instead of maintaining a vocabulary of all unique terms in the corpus and assigning a unique index to each term, Hashing-TF uses a hash function to directly map each term to a fixed-length vector (a fixed number of dimensions). This mapping is done in a way that the same term will always be mapped to the same index. The hash function typically hashes terms into a fixed range of indices, and collisions (multiple terms mapping to the same index) are handled by the hash function.
The resulting feature vectors are often sparse, meaning that most of the dimensions will be zero, except for the dimensions corresponding to the hashed indices of the terms present in the document. This can be advantageous in cases where memory efficiency is crucial.
The example below shows how to convert the tokenized words into feature vectors using the Amazon product reviews dataset.
# Import the necessary librariesfrom pyspark.sql import SparkSessionfrom pyspark.ml.feature import Tokenizer, HashingTF# Create a SparkSessionspark = SparkSession.builder.getOrCreate()# Read the Datasetprint("Loading the Amazon Product Reviews dataset")df = spark.read.csv("Amazon_Reviews.csv", header=True, inferSchema=True)# Tokenize the textprint("Tokenizing the text")tokenizer = Tokenizer(inputCol="Review", outputCol="words")words = tokenizer.transform(df)# Apply the `HashingTF` transformerprint("Applying HashingTF transformation to convert words into features")hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)featurizedData = hashingTF.transform(words)featurizedData.show()
Here’s a breakdown of the code:
- Lines 1–3: Import the required libraries, including
SparkSession
andTokenizer
. - Line 6: Create a
SparkSession
using thebuilder
pattern. - Line 10: Read the dataset from the “Amazon_Reviews” file.
- Lines 14–15: Tokenize the text using the
Tokenizer
Transformer. - Lines 19–21: Apply the
HashingTF
transformation to convert tokenized words into raw features - Lines 19–21: Print the extracted “rawFeatures” along with other columns.
TF-IDF (term frequency-inverse document frequency)
Inverse document frequency (IDF) is a metric that quantifies the information content of a term in a document corpus. It is calculated by taking the logarithm of the ratio between the total number of documents in ...