Transformers for Computer Vision Applications/

...

Transformers Pros and Cons

Explore the advantages and drawbacks of full attention and transformer architectures.

We'll cover the following...

Advantages and drawbacks of transformer architectures
Scalability comparison
Parallelism vs. autoregressive nature
Multimodal model fusion
- Universal function approximators for multimodal data
- Model capacity and pretraining
A simple code implementation

The "Maximum Path Length" in the table refers to the longest path through the network architecture, specifically for the self-attention mechanism. It’s a measure of how far information needs to travel between different input positions to influence the output at a given position. In the context of self-attention, it reflects the maximum number of sequential operations needed to establish relationships between distant tokens. For self-attention, the maximum path length is $O(1)$ , indicating that the model can capture dependencies regardless of the distance between tokens in constant time.

Scalability comparison

Let's discuss scalability. Imagine we have $n$ tokens as our input, which can represent words or image patches. Each token is associated with a dimension $d$ , which can be the word embedding, image dimensions, or features encoded by a convolutional neural network (CNN). For each image patch, this dimension $d$ serves as a representation. To compare the self-attention architecture with CNNs using a kernel $K$ and a recurrent model with a sequence length $n$ (representing the number of tokens in self-attention), we need to understand their differences.

Self-attention: In self-attention, encoding $n$ tokens requires just one layer, and this happens in parallel. This results in an $O(1)$ sequential operation with a path length of one. With a complexity of $O(n^2 \times d)$ , keeping in mind that the order of $n$ is usually in the tens (e.g., 40 words in a sequence), while $d$ ...

Layer Type	Complexity Per Layer	Sequential Operations	Maximum Path Length
Self-Attention	O(n².d)	O(1)	O(1)
Recurrent	O(n.d²)	O(n)	O(n)
Convolutional	O(k.n.d²)	O(1)	O(log_k(n))

Introduction

Overview of Transformer Networks

Neural Machine Translation with a Transformer and Keras

Transformers in Computer Vision

Vision Transformer for Image Classification

Transformers in Image Classification

Fine-Tuning Vision Transformers for Image Classification

Transformers in Object Detection

Transformers in Semantic Segmentation

Spatio-Temporal Transformers

Object Detection with Vision Transformers

Wrap Up

Transformers Pros and Cons

Advantages and drawbacks of transformer architectures

Scalability comparison