Transformer Networks
Learn about sequence-to-sequence (seq2seq) modeling and Transformer networks.
We'll cover the following
Sequence-to-sequence (seq2seq) modeling
Recurrent architectures such as RNNs have long dominated seq2seq modeling. These model architectures process the sequences, such as text, iteratively (i.e., one element at a time and in order). This sequential handling imposes a challenge when the model needs to learn long-range dependencies due to rising issues such as vanishing gradients. As the gap between relevant token elements increases, these models tend to lose track of learned sequences from early time steps, resulting in incomplete context understanding, which is highly necessary for language learning.
Let’s take a look at an example: “The cat that the dog chased ran up a tree.” This sentence contains long-range dependencies between the earlier (e.g., cat) and later (e.g., ran) words. The RNN will process this sentence iteratively (i.e., token-by-token) and needs to learn the long-range dependencies. In this case, the RNN may not be able to connect the relationship between “cat” and “ran” together since several words are present in between.
To solve this problem, how about we design a model that can process the entire sequence “The cat that the dog chased ran up a tree” in parallel and capture the relationship between all pairs of tokens in the given sequence—simultaneously. This is precisely what the Transformer model does. It models long-range dependencies across the entire sequence using the self-attention mechanism and computes the relationship between all pairs of tokens via dot-product attention.
Transformers
The transformer architecture was introduced in the paper “
Get hands-on with 1300+ tech skills courses.