In this answer, we'll be taking a close look at the transformer, what makes a transformer so special, and how they work.
Transformers were introduced in 2017 and originally used in neural machine translation. For example, translating English to French and vice-versa. Transformers are found to perform very well when implementing natural language processing tasks.
Machine translation is a part of the broader field of Natural Language Processing (NLP). Transformers were found to be able to beat many benchmarks in NLP with respect to translation. Transformers are the first machine learning architecture capable of generating long coherent texts that makes sense from start to finish. This makes a transformer comparable to a human. In its most advanced application or production, Transformers is automatically used to write computer code, e.g., the Github Copilot.
Another one is the DALL-E2 which generates realistic images from text inputs.
Another application is in the field of computational biology, where transformers have been used to solve an age-long of protein structure prediction. Transformers is also a promising technology in the field of genomics, and an example of this is DNABERT which has been used to achieve state-of-the-art results in genome tasks.
Transformers has been able to allow us to apply transfer learning to NLP. Transfer learning is a method that allows us to take the weights of a model trained for a particular kind of task and use it for a new task to get a better result. Previously, transfer learning was only applied in computer vision, where models were pre-trained on supervised data, ImageNet. But with the advent of transformers, we can apply transfer learning to NLP without even needing supervised data.
A transformer is a deep-learning algorithm that works based on the self-attention mechanism.
Transformers are neural networks built from the transformer blocks just the same way we have convolutional blocks for Convolutional Neural networks (CNN) and long short-term memory (LSTM) blocks for recurrent neural networks (RNN) etc. To understand the transformer block, we need to understand the attention mechanism, which is based on the transformer block.
Transformers use the attention mechanism and are created to handle sequential data by sequentially computing the hidden state for every time step of the input sequence. However, unlike RNNs, transformers do not possess a recurrent structure. This means that when provided with enough data for training, the attention mechanism can perform better than the RNN.
From the figure above, for each hidden state for a particular time,
A typical example of the importance of the attention mechanism is in language translation, such that a condition is essential in assigning the meaning of a given word in a sentence. For example, in a Spanish-to-English translation system, the initial word of the English output is highly dependent on the first few words of the Spanish input. This is not the case in a classical LSTM model, to produce the initial word of the English output, the model accepts only the state vector after it has processed the last Spanish word. In practice, all information supplied to the LSTM is poorly preserved and this is where the attention mechanism comes in. In the attention mechanism, the decoder is allowed access to the state of vectors of all the Spanish input words, not just the last, and can then learn the attention weight that determines how much attention is needed for each Spanish state vector.
The transformer has an encoder and decoder parts. But actually, what they have is six encoders and six decoders. The encoders are on the left side, and the decoders are on the right-hand side. This structure is illustrated in the figure below:
The encoder of a transformer consists of a self-attention layer (multi-head attention) which pays attention to the sentence passed to it and a feed-forward neural network layer. The decoder on the other hand consists of two self-attention layers and a feed-forward neural network layer. This is illustrated in the figure below:
The parallelization comes from how the data is been fed into the network. All the words of a sentence are fed into the network (specifically the encoder) at the same time. The first step inside the self-attention layer is that all the words of the sentence are compared to all other words of the sentence such that there is communication between the words. In the second step of the feed-forward neural network, the words are passed through a feed-forward neural network separately such that they do not have any information exchange. However, the feed-forward neural networks that these words are passed through are the same inside the same layer.
All the inputs that go in either the encoder or the decoder are embedded. Embeddings here mean that the inputs are converted to a vector list from which the model learns. The figure below shows the input and output of a transformer model.
As seen in the figure above, Positional encoders are added because a transformer does not have any recurrence, meaning the model has no way of understanding which of the words must have come first or last or where in the sentence. Therefore, positional encoding is added to make sure that there is information about any word added to the model telling the model where the word comes in, in the given sentence. For the output, a linear layer and a softmax is added at the end of the decoders so the output can be transformed into something meaningful to a human. This is expressed in the equation given below:
Where the matrices