...

/

Feature Extraction Techniques in PySpark MLlib

Feature Extraction Techniques in PySpark MLlib

Understand functions in PySpark for feature extraction.

Feature extraction is a crucial step in machine learning, where we transform raw data into a format that can be easily understood and processed by ML algorithms. It plays a vital role in ML because it helps to transform raw data into meaningful representations that capture the underlying patterns and characteristics of the data. These extracted features can then be used as input to ML models for training and prediction. PySpark MLlib provides a rich set of feature extraction techniques to handle diverse types of data.

Press + to interact

Here are some common feature extraction techniques available in PySpark MLlib.

Hashing-TF (Hashing Term Frequency)

Hashing-TF is a technique used for feature extraction in NLP and text analysis. It is particularly useful when we want to convert a collection of text documents into numerical feature vectors efficiently while minimizing memory usage.

In Hashing-TF, the process involves two key steps:

  • Tokenization: Tokenization is the process of breaking down the text into its constituent parts, such as words or phrases.

  • Hashing: Instead of maintaining a vocabulary of all unique terms in the corpus and assigning a unique index to each term, Hashing-TF uses a hash function to directly map each term to a fixed-length vector (a fixed number of dimensions). This mapping is done in a way that the same term will always be mapped to the same index. The hash function typically hashes terms into a fixed range of indices, and collisions (multiple terms mapping to the same index) are handled by the hash function.

The resulting feature vectors are often sparse, meaning that most of the dimensions will be zero, except for the dimensions corresponding to the hashed indices of the terms present in the document. This can be advantageous in cases where memory efficiency is crucial.

The example below shows how to convert the tokenized words into feature vectors using the Amazon product reviews dataset.

Press + to interact
main.py
Amazon_Reviews.csv
# Import the necessary libraries
from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer, HashingTF
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Read the Dataset
print("Loading the Amazon Product Reviews dataset")
df = spark.read.csv("Amazon_Reviews.csv", header=True, inferSchema=True)
# Tokenize the text
print("Tokenizing the text")
tokenizer = Tokenizer(inputCol="Review", outputCol="words")
words = tokenizer.transform(df)
# Apply the `HashingTF` transformer
print("Applying HashingTF transformation to convert words into features")
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(words)
featurizedData.show()

Here’s a breakdown of the code:

  • Lines 1–3: Import the required libraries, including SparkSession and Tokenizer.
  • Line 6: Create a SparkSession using the builder pattern.
  • Line 10: Read the dataset from the “Amazon_Reviews” file.
  • Lines 14–15: Tokenize the text using the Tokenizer Transformer.
  • Lines 19–21: Apply the HashingTF transformation to convert tokenized words into raw features
  • Lines 19–21: Print the extracted “rawFeatures” along with other columns.

TF-IDF (term frequency-inverse document frequency)

Inverse document frequency (IDF) is a metric that quantifies the information content of a term in a document corpus. It is calculated by taking the logarithm of the ratio between the total number of documents in ...