Feature Extraction
Learn how to extract features from data using scikit-learn.
In ML, features are the variables used for making predictions. Feature extraction involves transforming raw data into a set of features that can be used for training ML models. The scikit-learn library provides several methods for feature extraction, including feature hashing and text feature extraction. Because most ML models provided by scikit-learn need numerical data as input, you need to convert nonnumerical data before training the models.
Feature hashing for categorical data
Feature hashing, also known as the hash trick, is a method for transforming categorical variables into numerical variables. The idea behind feature hashing is to map each categorical value to a unique integer using a hash function and then use these integers as input features for ML models.
In other words, it’s a feature extraction method used to convert categorical data into numerical data. This is important because scikit-learn models cannot handle categorical data fed directly as text: the data needs to be converted to numbers first.
This method works by applying a hash function to the categorical data and mapping the hash values to a lower-dimensional feature space. The number of features in this lower-dimensional space can be specified, and the output is a sparse matrix.
The advantage of using FeatureHasher
is its ability to reduce the size of large datasets with a large number of unique categories. When applied to categorical data, the hash function results in a compact numerical representation of the data. The sparse matrix that is generated can be used as input for ML algorithms. However, it’s worth noting that ...