When we train models, we iterate over the training samples, make predictions about the training samples, and estimate the error between the predicated label and the real label. Next, we update the weights using the gradient of the error with respect to the weights. Usually, in deep models, we have to multiply a lot of terms in order to calculate the gradient. Two problems arise due to this approach.
Suppose, we have two vectors, both of which have all of the values greater than
vector1 = [1.3,2.1, 1.73,0.42,1.25]vector2 = [1.26,1.35,2.58,2.81,1.32]resultant = vector1for _ in range(5):for i in range(len(vector1)):resultant[i] = resultant[i]*vector2[i]print("Final product of vector1*(vector2)^5")print(resultant)
Consider a case where we have two vectors having all values less than one. Once we start multiplying such vectors, the values of the resultant vector will start shrinking. This problem is called the vanishing gradient problem.
vector1 = [0.3,0.1,0.73,0.42,0.25]vector2 = [0.26,0.35,0.58,0.81,0.32]resultant = vector1for _ in range(5):for i in range(len(vector1)):resultant[i] = resultant[i]*vector2[i]print("Final product of vector1*(vector2)^5")print(resultant)
Every time we multiply two vectors, we check whether the resultant vector is above a threshold parameter, and we normalize the values by the norm of the vector. It prevents the resultant from exploding in the next multiplication and ensures a good training process. This technique mostly solves the problem of exploding gradients.
Logically:
if resultant > threshold:
resultant = resultant / ||resultant||
Where ||resultant||
represents the norm of the vector which can be L1, L2 or any other norm.
### Tensorflow syntax ###tf.clip_by_global_norm(t_list, clip_norm, use_norm=None, name=None)### Pytorch syntax ###torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=2.0,error_if_nonfinite=False)