Feature Space
Learn feature engineering techniques by implementing feature space exploration, subspace analysis, and feature transformation.
We'll cover the following...
In both supervised learning and clustering, the term “data point” has been utilized. Specifically, each data point within the training dataset is represented as a -dimensional vector in such that . The elements of are referred to as features. Therefore, is also referred to as a feature vector.
Feature space
A feature space is a mathematical space that represents the features or attributes of a given dataset. Each observation in the dataset is represented by a vector in the feature space, where each dimension of the vector corresponds to a specific feature or attribute of the observation.
For example, let’s say we have a dataset of cars, and each car is described by its make, model, year, horsepower, and fuel efficiency. The feature space for this dataset would be a five-dimensional space, where each dimension corresponds to one of these features. By analyzing the patterns and relationships among the feature vectors in the feature space, we can gain insights into the underlying structure and characteristics of the dataset.
The feature space is the vector space that contains the feature vectors of a dataset, with each feature vector belonging to this space. The dataset is a sample collected from this feature space, where . In most cases, the feature space is a subspace of representing the underlying structure of the dataset.
Subspace example
Consider a dataset of cars , where each is a vector of features that describe a car, such as its make, model, year, horsepower, and fuel efficiency. Let’s say we want to define a subspace that only includes cars with horsepower greater than 250. We can define a subspace that only contains vectors where the horsepower feature is greater than 250 with the following code:
import numpy as np# Example feature vectors for cars dataset# [Make, Model, Year, Horsepower, Fuel efficiency (mpg)]car1 = np.array(['Toyota', 'Corolla', 2015, 132, 29])car2 = np.array(['Honda', 'Accord', 2020, 252, 33])car3 = np.array(['Tesla', 'Model S', 2018, 518, 98])car4 = np.array(['Ford', 'Mustang', 2010, 315, 22])car5 = np.array(['Chevrolet', 'Impala', 2012, 300, 23])# Create array of feature vectorscars = np.array([car1, car2, car3, car4, car5])# Define subspace of cars with horsepower greater than 250horsepower_subspace = cars[cars[:, 3].astype(int) > 250, :]# Print subspaceprint("Subspace of cars with horsepower greater than 250:\n")print(horsepower_subspace)
Here is the explanation for the code above:
- Lines 5–9: Here, we generate a random dataset of car features.
- Line 12: We create a vector where each row represents a car’s features.
- Line 15: We define a subspace of cars with horsepower greater than
250
by filtering the rows of the vector where the horsepower feature is greater than250
. - Lines 18–19 : We print the subspace.
Note: In real-world datasets, feature space is a subspace, but its dimension refers to the number of components in the feature vector, say . Therefore, we can think of it as a -dimensional vector space that contains feature vectors.
Feature transformations
Consider the regression dataset with ...