Key Concepts of Transformers

Mostly, the inability to fully understand transformers arises due to the confusion around secondary concepts. To prevent this from happening, we will gradually discuss all fundamental concepts and then construct a holistic view of transformers.

With Recurrent Neural Networks (RNN’s), we used to treat sequences sequentially to keep the order of the sentence in place. To satisfy that design, each RNN component (layer) needs the previous (hidden) output. As such, stacked LSTM computations were performed sequentially.

Then, transformers came out.

The fundamental building block of a transformer is self-attention. To begin, we need to get rid of sequential processing, recurrency, and LSTMs.We can do that by simply changing the input representation.

Representing the input sentence

Sets and tokenization

The transformer revolution started with a simple question:

Why don’t we feed the entire input sequence so there are no dependencies between hidden states? That might be cool!

As an example the sentence “Hello I love you”:

Get hands-on with 1300+ tech skills courses.