The Transformer's Encoder

Formulate the encoder of a transformer by combining all the building blocks.

Even though this could be a stand-alone building block, the creators of the transformer add another stack of two linear layers with an activation in-between and renormalize it along with another skip connection.

Add linear layers to form the encoder

Suppose xx is the output of the multi-head self-attention. What we will depict as linear in the diagram will look something like this:

import torch
import torch.nn as nn

dim = 512
dim_linear_block = 1024 ## usually a multiple of dim
dropout = 0.1

norm = nn.LayerNorm(dim)
linear = nn.Sequential(
            nn.Linear(dim, dim_linear_block),
            nn.ReLU()
            nn.Dropout(dropout),
            nn.Linear(dim_linear_block, dim),
            nn.Dropout(dropout)
        )

out = norm(linear(x) + x)

Dropout helps avoid overfitting. It is not exactly a linear model. As we saw in the second chapter, it can be called feedforward neural network or MLP (multi-layer perceptron). The code illustrates that it is not something new.

The idea of the linear layer after multi-head self-attention is to project the representation in a higher space and then back in the original space. This helps solve some stability issues and counter bad initializations.

Finally, this is the transformer’s encoder:

Get hands-on with 1300+ tech skills courses.