Overview of Transformers

Learn how transformers revolutionized the field of deep learning.

Transformer models in conversational AI

Over the last decades, multiple developments in the field of natural language processing (NLP) have resulted in achieving large language models (LLMs) and, in particular, the introduction of transformers. Transformers were introduced in the “Attention is All You Need” paper in 2017 by Ashish Vaswani et al.

Transformers revolutionized the field of deep learning, offering a modern architecture that outperforms the recurrent neural networks (RNNs) and long short-term memory (LSTM) networks which were widely used in deep learning. This architecture not only simplifies the structure of neural networks but also significantly reduces training time.

Press + to interact
The evolution of NLP through time
The evolution of NLP through time

Deep Neural Networks had already been in development for decades. In the 1990s, RNNs (recurrent neural networks) were conceived. A couple of years later, LSTMs (long short-term memory networks) were introduced in 1997. The concept of the basic attention mechanism became popular and utilized in neural network architectures around 2014, and it helped in improving the performance of various sequential models, including RNNs, LSTMs, and GRUs (Gated Recurrent Units). The transformer model was introduced in the paper “Attention is All You Need” in 2017. BERT (Bidirectional Encoder Representations from Transformers) was released by researchers at Google in 2018, and it became one of the first models to utilize the transformer architecture for NLP tasks. Transformer models are widely utilized, with many adaptations and improvements in 2018. Models such as GPT, T5, and others leverage and demonstrate the flexibility and effectiveness of the architecture. Transformers are utilized extensively in generative AI, as of 2020, with models such as GPT-3 showing amazing capabilities for generating human-like text.

Basically, Transformers process text by tokenizing words. Tokenization is the process of converting text into smaller units, or tokens, such as words or sub-words. This step is crucial for transforming natural language into a format that the model can process. These tokens are then transformed into vector representations using word embedding tables, allowing the model to understand and generate text. Transformers are implemented in many applications that we utilize on a daily basis, such as text completion features in smartphone messaging apps (next-word prediction and auto-correction).

Press + to interact
Device keyboard
Device keyboard

Once the text is embedded, the attention mechanism within the transformer model processes and interprets the input data, offering a more nuanced understanding and text generation capability. Essentially, the attention mechanism allows the model to focus on different parts of the input data when generating each word in the output by paying attention to the most relevant word at each step of the sequence. This is achieved by calculating how much importance each word in the input sequence should receive relative to other words when predicting a specific word in the output. The self-attention mechanism utilizes sets of queries, keys, and values derived from the input data to perform this calculation. As a result, transformers can understand context and the relationships between words. This ability to allocate attention across the input sequence allows transformers to generate responses that enhance the quality of interaction in applications such as chatbots.

The output of the self-attention mechanism is then passed through a feed-forward neural network to process that data before contributing the final output. In practical applications, such as when composing messages in a messaging app, a couple of words are suggested to the user. Under the hood, the sentence is sent to a neural network that predicts the next possible words with a probability vector, as shown below.

Press + to interact
Neural network
Neural network

This predictive capability stemming from the transformer’s ability to weigh the context and relevance of each word in the sequence, allows for the generation of contextually relevant suggestions, enhancing the user experience.

Understanding transformer architecture

Although the transformer architecture is less complex to understand than recurrent neural networks, it still consists of many blocks and layers, with each component comprising several more layers. Below is the famous transformer architecture:

Press + to interact
Transformers architecture
Transformers architecture

To understand transformers, we need to separate their architecture into two major blocks: the encoder (on the left side of the preceding picture) and the decoder (on the right side).

The encoder

  1. The text is sent to the transformer model.

  2. The text is encoded using tokenization and embedding methods.

  3. Positional encoding is applied to the previous output vector to keep the order of the words in the sentence or paragraph.

  4. Self-attention using query, key, and value vectors is performed on the positional encoded vectors. The dot product is taken between ...

Create a free account to view this lesson.

By signing up, you agree to Educative's Terms of Service and Privacy Policy