What are transformers?

In this answer, we'll be taking a close look at the transformer, what makes a transformer so special, and how they work.

Origin of transformers

Transformers were introduced in 2017 and originally used in neural machine translation. For example, translating English to French and vice-versa. Transformers are found to perform very well when implementing natural language processing tasks.

Machine translation is a part of the broader field of Natural Language Processing (NLP). Transformers were found to be able to beat many benchmarks in NLP with respect to translation. Transformers are the first machine learning architecture capable of generating long coherent texts that makes sense from start to finish. This makes a transformer comparable to a human. In its most advanced application or production, Transformers is automatically used to write computer code, e.g., the Github Copilot.

Another application is in the field of computational biology, where transformers have been used to solve an age-long of protein structure prediction. Transformers is also a promising technology in the field of genomics, and an example of this is DNABERT which has been used to achieve state-of-the-art results in genome tasks.

What makes transformers so special?

Transformers has been able to allow us to apply transfer learning to NLP. Transfer learning is a method that allows us to take the weights of a model trained for a particular kind of task and use it for a new task to get a better result. Previously, transfer learning was only applied in computer vision, where models were pre-trained on supervised data, ImageNet. But with the advent of transformers, we can apply transfer learning to NLP without even needing supervised data.

What are transformers, and how do they work?

A transformer is a deep-learning algorithm that works based on the self-attention mechanism.

Transformers are neural networks built from the transformer blocks just the same way we have convolutional blocks for Convolutional Neural networks (CNN) and long short-term memory (LSTM) blocks for recurrent neural networks (RNN) etc. To understand the transformer block, we need to understand the attention mechanism, which is based on the transformer block.

The attention mechanism of a transformer block

Transformers use the attention mechanism and are created to handle sequential data by sequentially computing the hidden state for every time step of the input sequence. However, unlike RNNs, transformers do not possess a recurrent structure. This means that when provided with enough data for training, the attention mechanism can perform better than the RNN.

From the figure above, for each hidden state for a particular time, $h(t)$ , we compute an attention score (or a weighted count) for each of the input vectors and input sequence. This will tell the hidden state how much to pay attention to any particular input. Importantly all the inputs are treated equally. The attention mechanism allows us to see all the inputs with equal clarity. Furthermore, the computation of each hidden state does not depend on the computation of any other hidden state, therefore, attention can be parallelized in the GPU, thereby making the computation of attention weights very efficient in time. This mechanism does not wait for the computation of any hidden state before computing another.

A typical example of the importance of the attention mechanism is in language translation, such that a condition is essential in assigning the meaning of a given word in a sentence. For example, in a Spanish-to-English translation system, the initial word of the English output is highly dependent on the first few words of the Spanish input. This is not the case in a classical LSTM model, to produce the initial word of the English output, the model accepts only the state vector after it has processed the last Spanish word. In practice, all information supplied to the LSTM is poorly preserved and this is where the attention mechanism comes in. In the attention mechanism, the decoder is allowed access to the state of vectors of all the Spanish input words, not just the last, and can then learn the attention weight that determines how much attention is needed for each Spanish state vector.

Basic structure of a Transformer

The transformer has an encoder and decoder parts. But actually, what they have is six encoders and six decoders. The encoders are on the left side, and the decoders are on the right-hand side. This structure is illustrated in the figure below:

The parallelization comes from how the data is been fed into the network. All the words of a sentence are fed into the network (specifically the encoder) at the same time. The first step inside the self-attention layer is that all the words of the sentence are compared to all other words of the sentence such that there is communication between the words. In the second step of the feed-forward neural network, the words are passed through a feed-forward neural network separately such that they do not have any information exchange. However, the feed-forward neural networks that these words are passed through are the same inside the same layer.

Inputs and outputs of a transformer

All the inputs that go in either the encoder or the decoder are embedded. Embeddings here mean that the inputs are converted to a vector list from which the model learns. The figure below shows the input and output of a transformer model.

As seen in the figure above, Positional encoders are added because a transformer does not have any recurrence, meaning the model has no way of understanding which of the words must have come first or last or where in the sentence. Therefore, positional encoding is added to make sure that there is information about any word added to the model telling the model where the word comes in, in the given sentence. For the output, a linear layer and a softmax is added at the end of the decoders so the output can be transformed into something meaningful to a human. This is expressed in the equation given below:

$Attention(Q, K, V) = softmax$ $(\frac{QK^{T}}{\sqrt{dk}})V$

Where the matrices $Q$ , $K$ , and $V$ are defined as the matrices where the ith rows are vectors q_i, k_i, and v_i respectively. Basically, a single vector has a length of the number of words in the vocabulary, and these cells tell us how likely it is that this word in this cell is going through the next word in our sequence.

Free Resources

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments