Feature Space

Learn feature engineering techniques by implementing feature space exploration, subspace analysis, and feature transformation.

In both supervised learning and clustering, the term “data point” has been utilized. Specifically, each data point x\bold x within the training dataset is represented as a dd-dimensional vector in Rd\R^d such that xRd\bold x \in \R^d. The elements of x\bold x are referred to as features. Therefore, x\bold x is also referred to as a feature vector.

Feature space

A feature space is a mathematical space that represents the features or attributes of a given dataset. Each observation in the dataset is represented by a vector in the feature space, where each dimension of the vector corresponds to a specific feature or attribute of the observation.

For example, let’s say we have a dataset of cars, and each car is described by its make, model, year, horsepower, and fuel efficiency. The feature space for this dataset would be a five-dimensional space, where each dimension corresponds to one of these features. By analyzing the patterns and relationships among the feature vectors in the feature space, we can gain insights into the underlying structure and characteristics of the dataset.

The feature space is the vector space that contains the feature vectors of a dataset, with each feature vector belonging to this space. The dataset is a sample collected from this feature space, where xRd\bold x \in \R^d. In most cases, the feature space is a subspace of RdR^d representing the underlying structure of the dataset.

Subspace example

Consider a dataset of cars D=(x1,y1),(x2,y2),,(xn,yn)D={(\bold x_1, y_1), (\bold x_2, y_2), \dots, (\bold x_n, y_n)}, where each xiRd\bold x_i \in \R^d is a vector of features that describe a car, such as its make, model, year, horsepower, and fuel efficiency. Let’s say we want to define a subspace that only includes cars with horsepower greater than 250. We can define a subspace that only contains vectors where the horsepower feature is greater than 250 with the following code:

Press + to interact
import numpy as np
# Example feature vectors for cars dataset
# [Make, Model, Year, Horsepower, Fuel efficiency (mpg)]
car1 = np.array(['Toyota', 'Corolla', 2015, 132, 29])
car2 = np.array(['Honda', 'Accord', 2020, 252, 33])
car3 = np.array(['Tesla', 'Model S', 2018, 518, 98])
car4 = np.array(['Ford', 'Mustang', 2010, 315, 22])
car5 = np.array(['Chevrolet', 'Impala', 2012, 300, 23])
# Create array of feature vectors
cars = np.array([car1, car2, car3, car4, car5])
# Define subspace of cars with horsepower greater than 250
horsepower_subspace = cars[cars[:, 3].astype(int) > 250, :]
# Print subspace
print("Subspace of cars with horsepower greater than 250:\n")
print(horsepower_subspace)

Here is the explanation for the code above:

  • Lines 5–9: Here, we generate a random dataset of car features.
  • Line 12: We create a vector where each row represents a car’s features.
  • Line 15: We define a subspace of cars with horsepower greater than 250 by filtering the rows of the vector where the horsepower feature is greater than 250.
  • Lines 18–19 : We print the subspace.

Note: In real-world datasets, feature space is a subspace, but its dimension refers to the number of components in the feature vector, say dd. Therefore, we can think of it as a dd-dimensional vector space that contains feature vectors.

Feature transformations

Consider the regression dataset with d=1d=1 ...