In the world of machine learning, the performance of models heavily relies on the quality and preparation of the input data. Before feeding data into algorithms, it is essential to preprocess it appropriately to ensure accurate and reliable results. Two common techniques used for data preprocessing are scaling and normalization. While these terms are often used interchangeably, they have distinct purposes and methodologies.
In both cases, we alter the values of numeric variables to possess certain useful properties. However, they vary in the manner of data transformation:
In scaling, we modify the range of the data.
In normalization, we modify the shape of the distribution of the data.
Let's delve further into scaling and normalization to better understand each technique.
Scaling refers to the process of transforming data so that it falls within a specific range. The goal is to ensure that all features have similar scales to compare them on equal footing. This technique is particularly useful for distance-based or optimization algorithms that rely on gradient descent.
Maximum absolute scaling, also known as Max Abs Scaling, is a technique that normalizes numerical features by dividing each value by the maximum absolute value of the respective feature.
The formula for maximum absolute scaling is as follows:
In Max Abs Scaling, the maximum absolute value of the feature is identified, and all the values in that feature are divided by this maximum value. This scaling technique ensures that the scaled values fall within the range [-1, 1]. The sign of the values is preserved, meaning positive and negative values retain their original polarity.
Min-max scaling is a popular technique used to rescale numerical features within a specific range. It transforms the feature values into a range, typically between 0 and 1.
The formula for min-max scaling is as follows:
Min-max scaling ensures that the feature values are proportionally rescaled while preserving the original distribution shape. It is particularly useful when the absolute numerical values or the range of the features are essential for the analysis or modeling process.
This technique allows different features to be on a similar scale, avoiding bias or dominance caused by varying magnitudes. It can improve the performance of machine learning algorithms by ensuring that each feature contributes equally to the learning process.
Normalization refers to transforming the data to adhere to specific assumptions or requirements. Normalization aims to reshape the data distribution while preserving the relationships between data points. It standardizes the numeric data to a uniform scale while keeping the range unchanged.
Normalization techniques aim to bring the data into a standardized format, making it easier to compare and analyze. In simpler terms, it normalizes the distribution by evaluating the distance of each observation from the mean in terms of the standard deviation.
Z-score, also known as a standard score, is a statistical metric used to determine the number of standard deviations a specific data point deviates from the mean of a distribution. It is a normalization technique used to standardize data and transform it to have a mean of 0 and a standard deviation of 1.
The formula for calculating the Z-Score of a data point is as follows:
Where:
The z-score measures the relative position of the data point within the distribution. A z-score of 0 indicates that the data point is exactly at the mean, while positive and negative z-scores indicate how many standard deviations the data point is above or below the mean, respectively.
By standardizing the data to have a mean of 0 and a standard deviation of 1, z-scores make it easier to interpret and compare data across different distributions or variables.
Mean normalization, also known as feature centering, is a technique used to normalize data by subtracting the mean value of a feature from each data point.
The formula for mean normalization is as follows:
Where:
Mean normalization centers the data around zero. The resulting normalized values will have a mean of 0. It is useful in situations where the relative distances or deviations from the mean are more important than the absolute values and can help eliminate bias caused by the mean value.
Free Resources