The Transformer's Encoder

Formulate the encoder of a transformer by combining all the building blocks.

Even though this could be a stand-alone building block, the creators of the transformer add another stack of two linear layers with an activation in-between and renormalize it along with another skip connection.

Add linear layers to form the encoder

Suppose xx is the output of the multi-head self-attention. What we will depict as linear in the diagram will look something like this:

import torch
import torch.nn as nn

dim = 512
dim_linear_block = 1024 ## usually a multiple of dim
dropout = 0.1

norm = nn.LayerNorm(dim)
linear = nn.Sequential(
            nn.Linear(dim, dim_linear_block),
            nn.ReLU()
            nn.Dropout(dropout),
           
...