...

/

Transformers Pros and Cons

Transformers Pros and Cons

Explore the advantages and drawbacks of full attention and transformer architectures.

Now, let's explore some of the pros and cons of full-attention or transformer architectures.

Advantages and drawbacks of transformer architectures

Considering the design choices in natural language processing models, it’s essential to weigh the advantages and drawbacks associated with full-attention mechanisms or transformer architectures.

Layer Type

Complexity Per Layer

Sequential Operations

Maximum Path Length

Self-Attention

O(n2.d)

O(1)

O(1)

Recurrent

O(n.d2)

O(n)

O(n)

Convolutional

O(k.n.d2)

O(1)

O(logk(n))

The "Maximum Path Length" in the table refers to the longest path through the network architecture, specifically for the self-attention mechanism. It’s a measure of how far information needs to travel between different input positions to influence the output at a given position. In the context of self-attention, it reflects the maximum number of sequential operations needed to establish relationships between distant tokens. For self-attention, the maximum path length is O(1)O(1), indicating that the model can capture dependencies regardless of the distance between tokens in constant time.

Scalability comparison

Let's discuss scalability. Imagine we have nn tokens as our input, which can represent words or image patches. Each token is associated with a dimension dd, which can be the word embedding, image dimensions, or features encoded by a convolutional neural network (CNN). For each image patch, this dimension dd serves as a representation. To compare the self-attention architecture with CNNs using a kernel KK and a recurrent model with a sequence length nn(representing the number of tokens in self-attention), we need to understand their differences.

  • Self-attention: In self-attention, encoding nn tokens requires just one layer, and this happens in parallel. This results in an O(1)O(1) sequential operation with a path length of one. With a complexity of O(n2×d)O(n^2 \times d), keeping in mind that the order of nn is usually in the tens (e.g., 40 words in a sequence), while d ...

Access this course and 1400+ top-rated courses and projects.