The Transformer's Encoder
Formulate the encoder of a transformer by combining all the building blocks.
We'll cover the following...
Even though this could be a stand-alone building block, the creators of the transformer add another stack of two linear layers with an activation in-between and renormalize it along with another skip connection.
Add linear layers to form the encoder
Suppose is the output of the multi-head self-attention. What we will depict as linear in the diagram will look something like this:
import torch
import torch.nn as nn
dim = 512
dim_linear_block = 1024 ## usually a multiple of dim
dropout = 0.1
norm = nn.LayerNorm(dim)
linear = nn.Sequential(
nn.Linear(dim, dim_linear_block),
nn.ReLU()
nn.Dropout(dropout),
...