The Rise of Transformers

Get introduced to the evolution of transformers and their role in the competition with convolutional models in computer vision.

Let's start an exciting journey through the world of transformer networks, focusing on their emergence, key concepts, and impact on various domains of machine learning.

The rise of transformers

Transformers have shown a remarkable ascent in recent years within the deep learning community. We'll examine three crucial factors that fueled their success:

  • Inductive bias: We'll discuss how transformers come with their unique built-in assumptions about data.

  • Self-attention mechanism: We'll explore how this innovative mechanism allows models to weigh the importance of different data parts when making predictions.

  • Unsupervised pretraining: We'll explain the concept of training models on large, unlabeled datasets before fine-tuning them for specific tasks.

In the past few years, we've witnessed a sudden burst of major AI advancements, such as GPT, DALL·E, and Tesla's Full Self-Driving (FSD) system. Despite decades of AI research, these breakthroughs are relatively recent.

So, what has suddenly changed?

The answer lies in a combination of factors, including the rapid growth of available GPU and cloud computing resources. But more importantly, it's about a revolutionary new software model for neural networks. The transformer model, introduced in the groundbreaking paper Attention is All You NeedVaswani, Ashish, et al. "Attention Is All You Need." In Advances in Neural Information Processing Systems, 30, 2017. by Vaswani et al. in 2017, has propelled AI capabilities to new heights. It outperformed previous deep learning techniques and enabled groundbreaking progress in AI research.

While increased computing power played a vital role, it's essential to highlight that hardware advancements alone wouldn't have caused such a leap in capability. It required a software leap, and that's where transformers come in. This innovative architecture effectively harnesses the growing computational resources, enabling AI models to tackle more complex problems than ever before.

Attention

One of the key innovations of transformers is the concept of attention. Attention mechanisms allow the model to focus on relevant information while ignoring irrelevant parts of the input data. This ability is especially valuable in tasks like natural language processing, where context is crucial. Moreover, attention mechanisms can be computed in parallel, making the architecture highly efficient and scalable.

Another advantage of attention mechanisms is that they provide the network with a form of memory. This "memory" allows the model to capture long-range dependencies and relationships within the data. This capability is essential for tasks like language modeling, image generation, and autonomous driving.

In the past, recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) attempted to handle sequence-based problems in NLP, speech recognition, and time series analysis. However, these models faced challenges like vanishing and exploding gradients, limiting their ability to capture long-range dependencies. The transformer architecture overcame these limitations by using the attention mechanism effectively. Additionally, transformers process input sequences in parallel, enabling faster training and inference. Their success in modeling complex patterns and efficient processing led to their rapid rise, surpassing earlier approaches like RNNs and LSTMs.

Press + to interact

In summary, the recent surge in AI capabilities can be attributed to both hardware and software innovations, with the transformer architecture playing a central role. By effectively leveraging growing computational resources and introducing the attention mechanism in the Attention is All You Need paper, transformers have unlocked new possibilities in AI research and applications, leading to groundbreaking advances like GPT, DALL·E, and Tesla FSD. As we continue to explore the potential of transformers and other AI techniques, it's exciting to think about what other revolutionary developments might be just around the corner.

Transformers in NLP and computer vision

Next, we'll compare the adoption of transformers in NLP and computer vision separately. We'll show you how transformers quickly became dominant in NLP, replacing traditional models like recurrent neural networks. This transition was marked by the introduction of unsupervised pretraining, including models like BERT and GPT.

Universal architecture

Now, let's examine how the transformer network performed in computer vision and NLP individually. As mentioned earlier, transformers took the lead in 2017 in NLP but only started making an impact in computer vision in 2020.

Press + to interact
BLEU scores (higher is better) of single models on the English to German translation benchmark
BLEU scores (higher is better) of single models on the English to German translation benchmark

In the world of NLP, transformers have emerged triumphant in pivotal tasks such as machine translation, consistently achieving state-of-the-art performance as measured by BLEU metrics. BLEU (BiLingual Evaluation Understudy) is a metric crafted specifically for the automated evaluation of machine-translated text. This metric quantifies its assessment through a numerical score on a scale from zero to one, gauging the resemblance of the machine-translated text to a predefined set of high-quality reference translations.

Initially, attention mechanisms were introduced alongside traditional NLP models like recurrent neural networks and LSTMsLong short-term memory networks (LSTMs) are a type of recurrent neural network (RNN) designed to address the vanishing gradient problem, allowing for more effective learning and retention of sequential patterns in data.. However, in 2017, with the introduction of the Attention is All You Need paper, full attention models, or transformer models, took over the NLP domain.

From that point on, transformers dominated all NLP tasks, from text classification to sequence-to-sequence tasks like chatbots and question answering, fully replacing RNNs. Interestingly, the introduction of transformers in NLP required unsupervised pretraining, leading to the rise of large-scale language models like BERT and GPT. This paved the way for significant advancements in NLP, marking it as the "image moment" of NLP.

We can clearly see the competition between computer vision and NLP and how versatile deep learning models like transformers are for both fields, as well as others.

The race continues

Let’s explore how transformers gradually made their way into computer vision, showcasing their integration with attention mechanisms alongside convolutional models from 2014 to 2020. While transformers have made significant progress, a mix of attention and convolution is still prevalent in computer vision, mainly due to the computational cost of attention.

Press + to interact
Brief summary of key developments in attention in computer vision
Brief summary of key developments in attention in computer vision

The previous timeline shows a brief summary of key developments in attention mechanisms in computer vision, which have loosely occurred in four phases.

  • Phase 1: Adopted RNNs to construct attention.

  • Phase 2: Explicitly predicted important regions.

  • Phase 3: Implicitly completed the attention process.

  • Phase 4: Used self-attention methods.

Transformers everywhere

Just like in NLP, attention mechanisms began to be adopted in computer vision between 2014 and 2020. During this time, attention mechanisms were primarily coupled with convolution models, which were the mainstay of computer vision models. Some attention mechanisms, like the Recurrent Attention Model (RAM) and Channel Attention Module (CAM), helped improve the spatial representation of images. However, pure transformer structures, similar to Attention is All You Need in NLP, did not gain prominence until 2020.

Two significant networks emerged that were not purely based on attention models but combined ConvNets or CNNsConvolutional neural networks (ConvNets or CNNs) are specialized deep learning architectures designed for efficiently processing and recognizing patterns in grid-like data, such as images, through the use of convolutional layers that automatically learn hierarchical representations. as feature extractors with transformer networks. These were the DETR DETR (DEtection TRansfomer) is a state-of-the-art object detection model that employs a transformer architecture, enabling end-to-end training for simultaneous object localization and classification without the need for handcrafted anchor boxes or non-maximum suppression.model from Facebook Research (now Meta Research), used for object detection, and the vision transformer (ViT) from Google, which focused on transformer encoder-decoder structures. These models used pure attention mechanisms over image dimensions to obtain representations for various tasks in computer vision, such as image classification and object detection.

As we can see, the same encoder-decoder architecture that was traditionally used with ConvNets/CNNs found its place in transformers. This architecture is of a universal nature, applying to both NLP and computer vision.

Unlike NLP, transformers have not entirely replaced components in computer vision. The best-performing architectures still blend attention and ConvNets. Even in models like ViT and DETR, ConvNets are used as feature extractors because the attention operation is computationally expensive. ConvNets help reduce dimensionality and the cost of attention operations, proving to be valuable in various computer vision tasks.

Unlike NLP, convolution operations are more hardware-optimized and have libraries that optimize parallelization. This contrasts with the attention function, which is less optimized. Therefore, computer vision models still depend on convolution operations in conjunction with transformers.

Moreover, transformers are not ideal for hardware and embedded deployment due to their large size. Large pre-trained models based on unsupervised pretraining are required for them to perform well. Given their significant model capacity, compressing these models is a substantial challenge.

Multitasking with transformers

Transformers have infiltrated every computer vision task, from image classification to detection, segmentation, and even 3D point cloud classification and segmentation. We've also seen the emergence of temporal transformers that handle the time factor in video data. These temporal transformers naturally extend from sequential transformers used in NLP because words in the text are sequential, similar to frames in a video.

Press + to interact
Revolutionizing computer vision and the rise of temporal transformers
Revolutionizing computer vision and the rise of temporal transformers

In this course, we’ll explore spatial-temporal transformers, which can perform activities like video forecasting, recognition, and moving object detection. Additionally, we'll learn about multitask learning, where multiple tasks can be addressed using the same transformer encoder with multiple decoders. This concept can be coupled with spatial and temporal attention mechanisms using transformers.

Transformers have also expanded their reach into low-level vision tasks like super-resolution, image enhancement, and colorization. More recently, they've been applied in generative models, enabling tasks such as visual question answering, text-to-image generation, image-to-text generation, and multimodal tasks. Transformers encompass a wide array of computer vision areas, as evident from the broad scope they cover.