What is gradient clipping?

Gradient clipping and its needs

When we train models, we iterate over the training samples, make predictions about the training samples, and estimate the error between the predicated label and the real label. Next, we update the weights using the gradient of the error with respect to the weights. Usually, in deep models, we have to multiply a lot of terms in order to calculate the gradient. Two problems arise due to this approach.

Exploding gradient

Suppose, we have two vectors, both of which have all of the values greater than 11. Once we multiply them, each element will also be greater than 11. If we multiply the resultant with another vector that has all the values greater than 11, we'll again find the new resultant vector to have all values greater than 11. If we perform this operation multiple times, we'll eventually get to a point where the resultant vector will have values that are too large. This problem is called the exploding gradient problem.

vector1 = [1.3,2.1, 1.73,0.42,1.25]
vector2 = [1.26,1.35,2.58,2.81,1.32]
resultant = vector1
for _ in range(5):
for i in range(len(vector1)):
resultant[i] = resultant[i]*vector2[i]
print("Final product of vector1*(vector2)^5")
print(resultant)

Vanishing gradient

Consider a case where we have two vectors having all values less than one. Once we start multiplying such vectors, the values of the resultant vector will start shrinking. This problem is called the vanishing gradient problem.

vector1 = [0.3,0.1,0.73,0.42,0.25]
vector2 = [0.26,0.35,0.58,0.81,0.32]
resultant = vector1
for _ in range(5):
for i in range(len(vector1)):
resultant[i] = resultant[i]*vector2[i]
print("Final product of vector1*(vector2)^5")
print(resultant)

Gradient clipping

Every time we multiply two vectors, we check whether the resultant vector is above a threshold parameter, and we normalize the values by the norm of the vector. It prevents the resultant from exploding in the next multiplication and ensures a good training process. This technique mostly solves the problem of exploding gradients.

A simple visualization of gradient clipping

Logically:

if resultant > threshold:
    resultant = resultant / ||resultant||

Where ||resultant|| represents the norm of the vector which can be L1, L2 or any other norm.

### Tensorflow syntax ###
tf.clip_by_global_norm(
t_list, clip_norm, use_norm=None, name=None
)
### Pytorch syntax ###
torch.nn.utils.clip_grad_norm_(
parameters, max_norm, norm_type=2.0,
error_if_nonfinite=False
)

Free Resources