Dimensionality Reduction

Learn how to reduce the number of dimensions in a dataset.

In the simplest terms, dimensionality reduction is the process of reducing the number of features or variables in a dataset while preserving the most important information. This is particularly useful when dealing with datasets with a large number of features, as it can help simplify the data, making it easier to work with, as well as providing a form of regularization.

Dimensionality reduction can be helpful for many reasons. By reducing the number of features in a dataset, we can reduce the computational complexity of our models, which can lead to faster training and prediction times. Additionally, it helps avoid overfitting, which occurs when a model performs well on the training data but fails to generalize unseen data due to the model capturing noise or irrelevant patterns. This is usually caused either by having too many features or by choosing a model that is too complex.

Reducing dimensions can also help us visualize high-dimensional data by projecting it onto lower dimensions. This can be particularly useful for exploring and understanding complex datasets. By reducing dimensions, we can often discover underlying patterns and relationships that may not be apparent in higher dimensions.

Note: We should always scale our data before performing any of these techniques because they are sensitive to differences in scale between variables.

Let’s go through the main dimensionality reduction methods in scikit-learn.

Principal Component Analysis

Principal Component Analysis (PCA) is one of the most basic techniques for dimensionality reduction, and although the mathematical details are out of the scope of this lesson, it’s important to understand the intuition behind it.

PCA works by identifying a new set of variables, called principal components, which are linear combinations of the original variables. These principal components are chosen so that they capture the maximum amount of variability in the data. Without delving too much into the mathematical details, let’s try to build an understanding of how it works.

Imagine there’s a dataset with numerous variables that are highly correlated with each other. This means that there is a lot of redundancy in the data, and it can be hard to identify the underlying patterns. PCA helps us simplify the dataset by creating new variables that are uncorrelated with each other, allowing us to capture the most important information from the original data. These new variables are ordered by importance so that we can focus on the most significant ones and ignore the rest.

Imagine we have two highly correlated variables, X and Y, and we want to replace them with only one variable, which we will call PC1 (where PC stands for “Principal Component”):

Get hands-on with 1200+ tech skills courses.