PCA Implementation Steps: 4 to 6
We will continue to steps 4-6 of the principal component analysis.
4) Scale data
Next, you will import the Scikit-learn function StandardScaler
, which standardizes features by using zero as the mean for all variables and scaling to unit variance. The mean and standard deviation are then stored and used later with the transform
method, which recreates the data frame with the requested transformed values.
After importing StandardScaler
, you can assign it as a new variable, fit the function to the features contained in the data frame, and transform those values under a new variable name.
StandardScaler is often used in conjunction with PCA and other algorithms, including k-nearest neighbors and support vector machines, to rescale and standardize data features. In concert, they can, for example, give a dataset the properties of a standard normal distribution with a mean of zero and a standard deviation of one.
Without standardization, the PCA algorithm is likely to lock onto features that maximize variance. Another factor may exaggerate that, however. Notice that the variance of Age
changes dramatically when measured in days rather than in years. If left unchecked, this type of formatting might mislead the selection of components which is based on maximizing variance. StandardScaler
helps to avoid this problem by rescaling and standardizing variables.
Conversely, standardization might not be necessary for PCA if the scale of the variables is relevant to your analysis or consistent across variables. Further information regarding StandardScaler
can be found here.
Get hands-on with 1300+ tech skills courses.