Transformer Architecture: Embedding Layers
Learn about the embedding layers in the transformer.
We'll cover the following
Word embeddings provide a semantic-preserving representation of words based on the context in which words are used. In other words, if two words are used in the same context, they will have similar word vectors. For example, the words “cat” and “dog” will have similar representations, whereas “cat” and “volcano” will have vastly different representations.
Word vectors were initially introduced in the paper titled
General approach for word embeddings
Motivated by the original word vector algorithms, modern deep learning models use embedding layers to represent words and tokens. The following general approach (along with pretraining later to fine-tune these embeddings) is taken to incorporate word embeddings into a machine learning model:
Define a randomly initialized word embedding matrix (or pretrained embeddings, available to download for free).
Define the model (randomly initialized) that uses word embeddings as the inputs and produces an output (for example, sentiment or a language translation).
Train the whole model (embeddings and the model) end to end on the task.
Embeddings in transformer models
The same technique is used in transformer models. However, in transformer models, there are two different embeddings:
Token embeddings provide a unique representation for each token seen by the model in an input sequence.
Positional embeddings provide a unique representation for each position in the input sequence.
The token embeddings have a unique embedding vector for each token (such as character, word, and subword), depending on the model’s tokenizing mechanism.
The positional embeddings are used to signal the model where a token is appearing. The primary purpose of the positional embeddings server is to inform the transformer model where a word is appearing. This is because, unlike LSTMs/GRUs, transformer models don’t have a notion of sequence because they process the whole text in one go. Furthermore, a change to the position of a word can alter the meaning of a sentence/or a word. For example:
Ralph loves his tennis ball. It likes to chase the ball.
Ralph loves his tennis ball. Ralph likes to chase it.
In the sentences above, the word “it” refers to different things, and the position of the word “it” can be used as a cue to identify this difference. The original transformer paper uses the following equations to generate positional embeddings:
Get hands-on with 1400+ tech skills courses.