What is dimensionality reduction?

Why is dimensionality reduction important?

While it is possible for a dataset to have 4 features, or even 50 features, what happens when the number of features increases exponentially to 1000 or even 1 million?

Analyzing high dimension data can be computationally expensive and difficult to control, and it is easier to run into issues when analyzing more data than less. Therefore, dimensionality reduction is important because it helps us reduce the number of features while still retaining important information needed for the data analysis.

Using Principal Component Analysis (PCA)

We can use the PCA technique to reduce the dimensionality of our dataset. The advantage of PCA is that it focuses on the principal components that contribute more to the overall variance of the dataset.

Before we use PCA, we must scale the feature data. With scaling, the different variables are placed on a normalized scale. Scaling is important because it removes the dominating impact one variable might have over another because of its range (e.g., a weight of 60 kg seems much higher in magnitude than a height of 1.6 m).

In the following example, we will use the StandardScaler from sklearn. You can read more about it here.

The code from the previous step has been prepended in the backend.

Using the `SelectKBest` function

In sklearn, there is a function called SelectKBest that allows us to select features according to the $k$ highest scores. The function calculates a metric we choose, sorts the features according to their metric scores, and selects the $k$ best features.

You can read more about SelectKBest here.

For the purposes of this example, we will select the best 6 features. Since we have already scaled the data, we will apply SelectKBest to our scaled features. The metric we will use is f_classif, which is the ANOVA f-value between label/feature for classification tasks.

You can read more about f_classif here.

Free Resources

License: Creative Commons-Attribution-ShareAlike 4.0 (CC-BY-SA 4.0)

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments

TRENDING TOPICS

Learn to Code

Tech Interview Prep

Generative AI

Data Science

Machine Learning

GitHub Students Scholarship

Early Access Courses

Blind 75

Layoffs

Pricing

For Individuals

Try for Free

Gift a Subscription

CONTRIBUTE

Become an Author

Become an Affiliate

Earn Referral Credits

RESOURCES

Blog

Cheatsheets

Webinars

Answers

ABOUT US

Our Team

Careers

Hiring

Frequently Asked Questions

Press

LEGAL

Cookie Policy

Business Terms of Service

Data Processing Agreement

INTERVIEW PREP COURSES

Grokking the Modern System Design Interview

Grokking the Product Architecture Design Interview

Grokking the Coding Interview Patterns

Machine Learning System Design

What is dimensionality reduction?

Overview