Scaling the Dataset
Learn how scaling the dataset can have a meaningful impact on gradient descent.
Overview of learning rate results
The conclusion that we drew from looking at the results of the different learning rates was that it is best if all the curves are equally steep, so the learning rate is closer to optimal for all of them!
Achieving equally steep curves
How do we then achieve equally steep curves? The short answer: you have to “correctly” scale your dataset. Let us now go into depth about how scaling your dataset helps to achieve equally steep curves.
“Bad” feature
First, let us take a look at a slightly modified example, which we would be calling the “bad” dataset:
-
Over here, we multiplied our feature (
x
) by 10, so it is in the range[0, 10]
now and renamed tobad_x
. -
But since we do not want the labels (
y
) to change, we also divided the originaltrue_w
parameter by 10 and renamed itbad_w
. this way, bothbad_w * bad_x
andw * x
yields the same results.
true_b = 1true_w = 2N = 100# Data generationnp.random.seed(42)# We divide w by 10bad_w = true_w / 10# And multiply x by 10bad_x = np.random.rand(N, 1) * 10# So, the net effect on y is zero - it is still# the same as beforey = true_b + bad_w * bad_x + (.1 * np.random.randn(N, 1))# Displaying the bad_w parameter along with the bad_x values (first five)print("bad_w: {} \n\nbad_x: {}".format(bad_w, bad_x[:5]))
Then, we performed the same split as before for both the original and bad datasets, and plot the training sets side-by-side, as seen below:
# Generates train and validation sets# It uses the same train_idx and val_idx as before,# but it applies to bad_xbad_x_train, y_train = bad_x[train_idx], y[train_idx]bad_x_val, y_val = bad_x[val_idx], y[val_idx]# Displaying the training and validation data (first five)print("bad_x_train: {} \n\nbad_x_val: {}".format(bad_x_train[:5], bad_x_val[:5]))
The following figure shows the difference between the original training dataset and the bad training dataset:
The only difference between the two plots is the scale of feature x
. Its range was [0, 1]
before, but now it is [0, 10]
. The label y
has not changed, and we also did not touch true_b
.
Does this simple scaling have any meaningful impact on our gradient descent? Well, if it had not, we would not be asking it, right?
Let us compute a new loss surface, and compare it to the one we had before:
Looking at the contour values of the above figure, the dark blue line before was 4.0, and now it is 50.0. For the same range of parameter values, the loss values are much bigger.
Let us look at the cross-sections before and after we multiplied feature x
by 10:
What happened here? The red curve became much steeper (bigger gradient), and thus we must use a smaller learning rate to safely descend along with it.
More importantly, the difference in steepness between the red and the black curves increased.
This is exactly what we need to avoid!!
Do you remember why?
Because the size of the learning rate is limited by the steepest curve!
How can we fix it? Well, we ruined it by scaling it 10x bigger. Perhaps, we can make it better if we scale it in a different way.
Scaling / standardizing / normalizing
Different how? There is this beautiful thing called the StandardScaler
, which transforms a feature in such a way that it ends up with zero mean and unit standard deviation.
How does it achieve that? First, it computes the mean and the standard deviation of a given feature (x
) using the training set (N
points):
...