What is Principal Component Analysis (PCA)?

Introduction

Principal Component Analysis (PCA) is a fundamental method for dimensionality reduction as it can help us reduce the number of variables of a huge dataset into a smaller one, without losing much information. Our goal is to simplify the dataset as much as possible while trading off the least amount of information to maintain high accuracy. Minimized datasets are easier to analyze, evaluate, and visualize, which is why PCA is considered a core step in working with big data.

Steps for conducting PCA

PCA uses an orthogonal transformation to manipulate interdependent variables and change them into linearly independent variables, also known as principal components. PCA is a five-step procedure, as explained below.

Step 1: Standardizing the dataset

We will be standardizing the dataset by using the following formula:

xnew=xμσx_{new} = \frac{x - \mu}{\sigma}

where,

x = data point value

mu (μ\mu) = mean of the feature

sigma (σ\sigma) = standard deviation of the feature

Step 2: Computing the covariance matrix

Once we have obtained the standardized matrix, we can get the covariance matrix using the technique below:

Covariance Matrix


F1

F2

F3

F1

cov( F1, F1 ) = var ( F1 )

cov ( F1 , F2 )

cov ( F1 , F3 )

F2

cov ( F2 , F1 )

var ( F2 )

cov ( F2 , F3 )

F3

cov ( F3 , F1)

cov ( F3 , F2 )

var ( F3 )

Once we have obtained the covariance matrix, we calculate the eigenvectors.

Step 3: Identifying principal components by computing eigenvectors

Let A be a square matrix (in our case the covariance matrix), ν a vector, and λ a scalar that satisfies Aν = λν. Then λ, called the eigenvalue, is associated with the eigenvector ν of A.

Eigenvectors are non-zero vectors that change when a linear transformation is applied. Eigenvalues are the factors at which the eigenvector changes.

To calculate an eigenvector, we use the following equation:

AλI=0A - \lambda I = 0

where,

A = Covariance matrix

lambda (λ\lambda) = eigenvalues

I = Identity matrix

Once we solve the equation, we will obtain multiple eigenvalues, which will be used to calculate the eigenvector.

Step 4: Constructing the feature vector

This will be done by sorting out the eigenvectors by their eigenvalues. We will then be discarding the vectors with the least significant eigenvalues and selecting the top k eigenvectors with the highest eigenvalues.

k should be chosen such that k is the smallest value, when at least 1% of the variance stays the same.

Step 5: Transforming the matrix

To transform the matrix, we will be using the following formula:

M=FM×FVM = FM \times FV

where,

M = Transformed matrix

FM = Feature matrix (orignal dataset)

FV = Feature vector

Code example

We will be using the in-built dataset in R called mtcars. First, we import the devtools library so we can install ggbiplot library for making our PCA plots. Once that is done, we run PCA analysis on our data and plot to visualize the similarity of our data. Below, we have an example of running PCA in R.

#check if required libraries are installed.
require(devtools)
require(ggbiplot)
#pca analysis command for running PCA.
mtcars.pca <- prcomp(mtcars[,c(1:7,10,11)], center = TRUE,scale. = TRUE)
str(mtcars.pca)
#plotting PCA
ggbiplot(mtcars.pca, labels=rownames(mtcars))

Looking at the plot, we can see how three cars are clustered at the top. This shows their similarity and we can verify this as all three cars are sports cars and so the analysis makes sense. In our output window, we can also see how our principal components have been selected and what our scales are.

Code explanation

Lines 2 and 3: We check if the required libraries are installed for PCA. If not, the code will run into an error and you will need to install these packages.

Line 6: We use the built-in function prcomp for PCA.

Line 7: We then use the str function to visualize our analysis. This can be seen in the output window.

Line 10: Finally, we use the ggbiplot function to plot the PCA graph.

Improved PCA

For better visualization, we can cluster our data via countries. It will help us look at what countries prioritize what features in their cars.

#creating a cluster for cuntries
mtcars.country <- c(rep("Japan", 3), rep("US",4), rep("Europe", 7),rep("US",3), "Europe", rep("Japan", 3), rep("US",4), rep("Europe", 3), "US", rep("Europe", 3))
#plotting pca for visualization
ggbiplot(mtcars.pca,ellipse=TRUE, labels=rownames(mtcars), groups=mtcars.country)

We can see that US cars are more focused on hp, cyl ,disp and wt, whereas Japanese cars are focused on gear, drat, mpg and qsec. European cars tend to find a middle point between the two.

Code explanation

Line 2: We use the c and rep keywords to create a list of countries for each car to pass into groups part of our PCA.

Line 4: We use the ggbiplot function with the groups parameter to label different cars via their origin so we can differentiate between cars belonging to different countries.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved