PCA Implementation Steps: 4 to 6

We will continue to steps 4-6 of the principal component analysis.

4) Scale data

Next, you will import the Scikit-learn function StandardScaler, which standardizes features by using zero as the mean for all variables and scaling to unit variance. The mean and standard deviation are then stored and used later with the transform method, which recreates the data frame with the requested transformed values.

After importing StandardScaler, you can assign it as a new variable, fit the function to the features contained in the data frame, and transform those values under a new variable name.

StandardScaler is often used in conjunction with PCA and other algorithms, including k-nearest neighbors and support vector machines, to rescale and standardize data features. In concert, they can, for example, give a dataset the properties of a standard normal distribution with a mean of zero and a standard deviation of one.

Without standardization, the PCA algorithm is likely to lock onto features that maximize variance. Another factor may exaggerate that, however. Notice that the variance of Age changes dramatically when measured in days rather than in years. If left unchecked, this type of formatting might mislead the selection of components which is based on maximizing variance. StandardScaler helps to avoid this problem by rescaling and standardizing variables.

Conversely, standardization might not be necessary for PCA if the scale of the variables is relevant to your analysis or consistent across variables. Further information regarding StandardScaler can be found here.

Get hands-on with 1400+ tech skills courses.