Transformer Architecture: Residuals and Normalization
Learn about the residuals and normalization in the transformer architecture.
We'll cover the following
Another important characteristic of the transformer models is the existence of the residual connections and the normalization layers in between the individual layers of the transformer model.
Residual connections
Residual connections are formed by adding a given layer’s output to the output of one or more layers ahead. This, in turn, forms shortcut connections through the model and provides a stronger gradient flow by reducing the changes of the phenomenon known as vanishing gradients. The vanishing gradients problem causes the gradients in the layers closest to the inputs to be very small so that the training in those layers is hindered. The residual connections for deep learning models were popularized by the paper Deep Residual Learning for Image Recognition.
Get hands-on with 1400+ tech skills courses.