Principal Component Analysis (PCA) is a fundamental method for dimensionality reduction as it can help us reduce the number of variables of a huge dataset into a smaller one, without losing much information. Our goal is to simplify the dataset as much as possible while trading off the least amount of information to maintain high accuracy. Minimized datasets are easier to analyze, evaluate, and visualize, which is why PCA is considered a core step in working with big data.
PCA uses an orthogonal transformation to manipulate interdependent variables and change them into linearly independent variables, also known as principal components. PCA is a five-step procedure, as explained below.
We will be standardizing the dataset by using the following formula:
where,
x
= data point value
mu
(
sigma
(
Once we have obtained the standardized matrix, we can get the covariance matrix using the technique below:
F1 | F2 | F3 | |
F1 | cov( F1, F1 ) = var ( F1 ) | cov ( F1 , F2 ) | cov ( F1 , F3 ) |
F2 | cov ( F2 , F1 ) | var ( F2 ) | cov ( F2 , F3 ) |
F3 | cov ( F3 , F1) | cov ( F3 , F2 ) | var ( F3 ) |
Once we have obtained the covariance matrix, we calculate the eigenvectors.
Let A be a square matrix (in our case the covariance matrix), ν a vector, and λ a scalar that satisfies Aν = λν. Then λ, called the eigenvalue, is associated with the eigenvector ν of A.
Eigenvectors are non-zero vectors that change when a linear transformation is applied. Eigenvalues are the factors at which the eigenvector changes.
To calculate an eigenvector, we use the following equation:
where,
A
= Covariance matrix
lambda
(
I
= Identity matrix
Once we solve the equation, we will obtain multiple eigenvalues, which will be used to calculate the eigenvector.
This will be done by sorting out the eigenvectors by their eigenvalues. We will then be discarding the vectors with the least significant eigenvalues and selecting the top k eigenvectors with the highest eigenvalues.
k should be chosen such that k is the smallest value, when at least 1% of the variance stays the same.
To transform the matrix, we will be using the following formula:
where,
M
= Transformed matrix
FM
= Feature matrix (orignal dataset)
FV
= Feature vector
We will be using the in-built dataset in R called mtcars
. First, we import the devtools
library so we can install ggbiplot
library for making our PCA plots. Once that is done, we run PCA analysis on our data and plot to visualize the similarity of our data. Below, we have an example of running PCA in R.
#check if required libraries are installed.require(devtools)require(ggbiplot)#pca analysis command for running PCA.mtcars.pca <- prcomp(mtcars[,c(1:7,10,11)], center = TRUE,scale. = TRUE)str(mtcars.pca)#plotting PCAggbiplot(mtcars.pca, labels=rownames(mtcars))
Looking at the plot, we can see how three cars are clustered at the top. This shows their similarity and we can verify this as all three cars are sports cars and so the analysis makes sense. In our output window, we can also see how our principal components have been selected and what our scales are.
Lines 2 and 3: We check if the required libraries are installed for PCA. If not, the code will run into an error and you will need to install these packages.
Line 6: We use the built-in function prcomp
for PCA.
Line 7: We then use the str
function to visualize our analysis. This can be seen in the output window.
Line 10: Finally, we use the ggbiplot
function to plot the PCA graph.
For better visualization, we can cluster our data via countries. It will help us look at what countries prioritize what features in their cars.
#creating a cluster for cuntriesmtcars.country <- c(rep("Japan", 3), rep("US",4), rep("Europe", 7),rep("US",3), "Europe", rep("Japan", 3), rep("US",4), rep("Europe", 3), "US", rep("Europe", 3))#plotting pca for visualizationggbiplot(mtcars.pca,ellipse=TRUE, labels=rownames(mtcars), groups=mtcars.country)
We can see that US cars are more focused on hp
, cyl
,disp
and wt
, whereas Japanese cars are focused on gear
, drat
, mpg
and qsec
. European cars tend to find a middle point between the two.
Line 2: We use the c
and rep
keywords to create a list of countries for each car to pass into groups
part of our PCA.
Line 4: We use the ggbiplot
function with the groups
parameter to label different cars via their origin so we can differentiate between cars belonging to different countries.
Free Resources