...

/

Key Concepts of Transformers

Key Concepts of Transformers

Represent text with positional encodings and embedding so it can be passed into a transformer.

Mostly, the inability to fully understand transformers arises due to the confusion around secondary concepts. To prevent this from happening, we will gradually discuss all fundamental concepts and then construct a holistic view of transformers.

With Recurrent Neural Networks (RNN’s), we used to treat sequences sequentially to keep the order of the sentence in place. To satisfy that design, each RNN component (layer) needs the previous (hidden) output. As such, stacked LSTM computations were performed sequentially.

Then, transformers came out.

The fundamental building block of a transformer is self-attention. To begin, we need to get rid of sequential processing, recurrency, and LSTMs.We can do that by simply changing the input representation.

Representing the input sentence

Sets and tokenization

The transformer revolution started with a simple question:

Why don’t we feed the entire input sequence so there are no dependencies between hidden states? That might be cool!

As an example the sentence “Hello I love you”:

This processing step is usually called tokenization, and it is the first out of three steps we need to perform before we feed the input in the model.

So instead of a sequence of elements, we now have a set.

Sets are a collection of distinct elements where the arrangement of the elements in the set does not matter.

In other words, the order is irrelevant. We denote the input set as X=[x1,x2,x3,xNX= [ x_1, x_2, x_3 … , x_N] where XRN×din{X} \in R^{N \times d_{in}} ...