Sequence-to-sequence architecture in transformers

A transformer is a type of neural network architecture that can process a whole sequence and understand the relationship between the elements of the sequence. The foundation of transformers is sequence-to-sequence architecture (Seq2Seq), which is the fundamental pattern in natural language processing (NLP) and machine learning (ML) that helps models map input to output sequences.

Sequences have correlated data elements in an ordered form, such as a sequence of pixels in an image or sequences of words in a sentence. Each element in a sequence is a feature vector correlated with other vectors in the sequence, known as the context of the vector.

In this Answer, we’ll discuss how the sequence-to-sequence architecture is deployed in transformers.

Attention in transformers

Transformer attention mechanisms refine the model’s ability to efficiently understand and process sequential data. Unlike traditional recurrent neural networks (RNNs), which process the sequences sequentially, transformers use mechanisms such as self-attention to represent the elements of the input sequence in parallel with the processing of the sequence. This helps deduce the dependencies between the elements of the input sequence and the other contextual elements simultaneously.

The transformer model has two types of attention mechanisms:

Self-attention: Self-attention helps determine the relativeness of the elements within a sequence. We focus on the most relevant information based on the weights calculated for each element. See the example below on how the data element “it” is pointing to “cat,” not “mat,” based on the weight depicted by the shade of blue. The darker shades represent high-weight values.

The self-attention mechanism in transformers involves computing the weights of each data element in the sequence. This helps focus on the important data elements in the sequence to extract relevant information during the sequence processing. By extracting the weight and scores of the data elements in the input sequence, transformers generate focused contextual vectors. Transformers also have multi-head attention mechanisms, where self-attention is performed multiple times in parallel.

Encoder-decoder attention: This type of attention involves two sequences, one as a source input sequence and the other as an output sequence. Encoder-decoder attention extracts the information about the connection between the two sequences to efficiently transform from input to output.

Attention in transformers plays a crucial role in enabling efficient and effective processing of sequential data. We’ll learn how the attention mechanism works in the architecture of the transformers.

Encoder-decoder architecture

The core of sequence-to-sequence models is the use of two neural networks—an encoder and a decoder—to transform input sequences into fixed-size context vectors and decode output sequences from the vectors.

Encoder

The encoder maps the input sequence into a fixed-size vector called the context vector. The context vector is then used to generate the output sequence.

In transformers, however, the encoder contains multiple layers of self-attention mechanisms that are followed by feedforward networks. The self-attention mechanism allows the encoder to calculate the importance of the relation of one data element of the sequence with respect to the rest of the data elements. This importance of weighing data elements helps detect the local and global context of the input sequence.

Decoder

The decoder takes the output from the encoder, and the fixed-sized context vector generated from the input sequence is fed as input to the decoder. The decoder then generates the sequence of data elements as an output sequence.

Similar to an encoder, the decoder in transformers consits of multiple layers, having self-attention mechanisms followed by cross-attention mechanisms over the encoder’s output. The cross-attention mechanisms enable the decoder to focus on relevant parts of the input sequence to generate the elements of the output sequence, which facilitates the alignment between input and output sequences.

Working example

A text translation model uses an encoder-decoder mechanism to translate text from French to English. Look at the image given below to understand its working:

The encoder in a translator can be considered as a person who speaks French and Korean. The decoder can be thought of as a person who speaks English and Korean. To translate, the encoder converts French into Korean (context vector), and the decoder uses the context vector and translates it into English language. As Korean is the common language, it is considered a context vector so both parties can understand it as an intermediate state, the context vector.

Applications of Seq2Seq in transformers

The deployment sequence-to-sequence architecture in transformers has optimized the training process and has become efficient in dealing with long data sequences. After deployment, sequence-to-sequence transformers are used in multiple roles; let’s discuss some of them here:

Machine translation: Transformers use the Seq2Seq model to translate text from one language to another. The encoder takes the input in the form of a sentence in one language and makes a fixed-length context vector. Then, it uses the decoder and gives it the context vector as input; the generated text from the decoder is the translation of the input vector.
Text summarization: Transformers employ Seq2Seq architecture to perform both extractive (extract and combine the important parts of the content) and abstractive (generating new summaries) text summarization. The process involves feeding the document to be summarized as input to the encoder, creating a contextual vector. The decoder then transforms it into a concise summary. Transforms use self-attention mechanisms to extract out important information and generate accurate summaries considering the most highlighted important parts of the document.
Conversational AI: Seq2Seq models are used to power chatbots and virtual assistants by generating relevant responses after understanding the context of the user queries. The transformers have a bidirectional nature, allowing them to store the context of the conversation for a longer time; this leads to more human-like and engaging conversations.
Speech recognition: Seq2Seq architecture is used in speech recognition systems to transcribe audio to text, which is most commonly used in voice assistants, speech-to-text services, etc. Transformers are well-suited for speech recognition tasks that involve lengthy audio sequences because of their ability to extract long-range dependencies.

Due to interpretability and the improved attention mechanism, using Seq2Seq architecture has proven beneficial compared to other traditional models like RNNs. It has revolutionized many NLP tasks by capturing long-range dependencies and generating efficient contextual information for the provided sequence. There is research underway to improve the attention mechanism to further optimize the workings of Seq2Seq models to solve real-world challenges.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources