Transformers Pros and Cons
Explore the advantages and drawbacks of full attention and transformer architectures.
Now, let's explore some of the pros and cons of full-attention or transformer architectures.
Advantages and drawbacks of transformer architectures
Considering the design choices in natural language processing models, it’s essential to weigh the advantages and drawbacks associated with full-attention mechanisms or transformer architectures.
Layer Type | Complexity Per Layer | Sequential Operations | Maximum Path Length |
Self-Attention | O(n2.d) | O(1) | O(1) |
Recurrent | O(n.d2) | O(n) | O(n) |
Convolutional | O(k.n.d2) | O(1) | O(logk(n)) |
The "Maximum Path Length" in the table refers to the longest path through the network architecture, specifically for the self-attention mechanism. It’s a measure of how far information needs to travel between different input positions to influence the output at a given position. In the context of self-attention, it reflects the maximum number of sequential operations needed to establish relationships between distant tokens. For self-attention, the maximum path length is
Scalability comparison
Let's discuss scalability. Imagine we have
Self-attention: In self-attention, encoding
tokens requires just one layer, and this happens in parallel. This results in an sequential operation with a path length of one. With a complexity of , keeping in mind that the order of is usually in the tens (e.g., 40 words in a sequence), while ...