/

Feature Extraction Techniques in PySpark MLlib

Feature extraction is a crucial step in machine learning, where we transform raw data into a format that can be easily understood and processed by ML algorithms. It plays a vital role in ML because it helps to transform raw data into meaningful representations that capture the underlying patterns and characteristics of the data. These extracted features can then be used as input to ML models for training and prediction. PySpark MLlib provides a rich set of feature extraction techniques to handle diverse types of data.

Hashing-TF (Hashing Term Frequency)

Hashing-TF is a technique used for feature extraction in NLP and text analysis. It is particularly useful when we want to convert a collection of text documents into numerical feature vectors efficiently while minimizing memory usage.

In Hashing-TF, the process involves two key steps:

Tokenization: Tokenization is the process of breaking down the text into its constituent parts, such as words or phrases.
Hashing: Instead of maintaining a vocabulary of all unique terms in the corpus and assigning a unique index to each term, Hashing-TF uses a hash function to directly map each term to a fixed-length vector (a fixed number of dimensions). This mapping is done in a way that the same term will always be mapped to the same index. The hash function typically hashes terms into a fixed range of indices, and collisions (multiple terms mapping to the same index) are handled by the hash function.

The resulting feature vectors are often sparse, meaning that most of the dimensions will be zero, except for the dimensions corresponding to the hashed indices of the terms present in the document. This can be advantageous in cases where memory efficiency is crucial.

The example below shows how to convert the tokenized words into feature vectors using the Amazon product reviews dataset.

Python 3.8

Files

# Import the necessary libraries
from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer, HashingTF
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Read the Dataset
print("Loading the Amazon Product Reviews dataset")
df = spark.read.csv("Amazon_Reviews.csv", header=True, inferSchema=True)
# Tokenize the text
print("Tokenizing the text")
tokenizer = Tokenizer(inputCol="Review", outputCol="words")
words = tokenizer.transform(df)
# Apply the `HashingTF` transformer
print("Applying HashingTF transformation to convert words into features")
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(words)
featurizedData.show()

Feature Extraction Techniques in PySpark MLlib

Hashing-TF (Hashing Term Frequency)

TF-IDF (term frequency-inverse document frequency)