In this shot, we will look at what dimensionality reduction is and why it is important.
To understand dimensionality, we need to understand what a dataset in Machine Learning (ML) is.
A dataset is simply a collection of data. Many ML projects use tabular data. Tabular data is data that contains rows and columns of information, e.g., a spreadsheet. Dimensionality refers to the number of features or columns a dataset has. For example, in the image below, the dimensionality of the dataset is 10 as there are 10 columns.
Dimensionality reduction is the process of transforming data from being in a high dimensional space to being in a low dimensional space. It can also refer to a number of techniques that are employed to reduce the number of input features in a dataset.
While it is possible for a dataset to have 4 features, or even 50 features, what happens when the number of features increases exponentially to 1000 or even 1 million?
Analyzing high dimension data can be computationally expensive and difficult to control, and it is easier to run into issues when analyzing more data than less. Therefore, dimensionality reduction is important because it helps us reduce the number of features while still retaining important information needed for the data analysis.
Since this post is an introduction to dimensionality reduction, we won’t go into the details of the methods mentioned below. However, you can read more about them in your spare time.
Some dimensionality reduction methods include:
In this section, we will look at dimensionality reduction in Machine Learning. This might also be referred to as feature engineering / feature selection. Feature selection is choosing a subset of relevant features for use in building or constructing a model. In applying feature selection, the dimensions of the dataset are reduced.
For the following examples, the breast cancer
dataset from the scikit-learn
inbuilt datasets will be used.
import numpy as npimport pandas as pdimport sklearn.datasets as datasetsbreast_cancer = datasets.load_breast_cancer()cancer = pd.DataFrame(breast_cancer.data, columns=[breast_cancer.feature_names])cancer['target'] = breast_cancer.targetprint(cancer.shape)
From the code above, we see that the original dataset has 30 features. The remaining column is the target (that is, what we are trying to predict).
We can use the PCA technique to reduce the dimensionality of our dataset. The advantage of PCA is that it focuses on the principal components that contribute more to the overall variance of the dataset.
Before we use PCA, we must scale the feature data. With scaling, the different variables are placed on a normalized scale. Scaling is important because it removes the dominating impact one variable might have over another because of its range (e.g., a weight of 60 kg seems much higher in magnitude than a height of 1.6 m).
In the following example, we will use the StandardScaler from sklearn
. You can read more about it here.
The code from the previous step has been prepended in the backend.
from sklearn.preprocessing import StandardScalercancer_features = cancer.drop('target', axis=1)scaler = StandardScaler()scaler.fit(cancer_features)scaled_data = scaler.transform(cancer_features)print(scaled_data)
Now that the data has been scaled, we can perform PCA. With the PCA algorithm, you can select the number of components and change this number as you see fit. For this example, we will select the arbitrary number of 5 components.
from sklearn.decomposition import PCApca = PCA(n_components=5)pca.fit(scaled_data)scaled_pca = pca.transform(scaled_data)print(scaled_data.shape)print(scaled_pca.shape)
From the code above, we have successfully reduced the dimensionality of the features from 30 to 5 using PCA.
SelectKBest
functionIn sklearn
, there is a function called SelectKBest
that allows us to select features according to the highest scores. The function calculates a metric we choose, sorts the features according to their metric scores, and selects the best features.
You can read more about
SelectKBest
here.
For the purposes of this example, we will select the best 6 features. Since we have already scaled the data, we will apply SelectKBest
to our scaled features. The metric we will use is f_classif
, which is the ANOVA f-value between label/feature for classification tasks.
You can read more about
f_classif
here.
import sklearn.feature_selection as fsbest_k = fs.SelectKBest(fs.f_classif, 6)best_k.fit(cancer_features, cancer['target'])best_k_features = best_k.transform(cancer_features)print(cancer_features.shape)print(best_k_features.shape)
From the code above, we have successfully reduced the dimensionality of the features from 30 to 6 using SelectKBest
.
Dimensionality reduction is commonly used in fields that process and work with high volumes of data, such as bioinformatics and signal processing. Dimensionality reduction also performs tasks such as noise reduction and visualization.
To conclude, you’ve read about what dimensionality is, as it relates to a dataset, and some issues that might arise when a dataset has many features. You’ve also seen the names of some dimensionality reduction methods and explored the implementation of two of them in Python. Finally, you’ve learned some fields that employ dimensionality reduction.
Hopefully, this article was helpful. Thanks for reading and have a great day.