Self-Attention Mechanism

Let's explore a unique mechanism that's especially important when dealing with images—the self-attention mechanism.

Attention is a fundamental concept in deep learning, often described using query and value vectors. Now, we'll introduce another vector known as the key vector.

Understanding the self-attention mechanism

As we learn about the self-attention mechanism, its terminology is fundamental for grasping the intricacies of its powerful concept:

  • Query (QQ): This is what we're looking for or trying to match with.

  • Key (KK): This is what we use to identify or locate the specific thing we're interested in.

  • Value (VV): This is the actual content or information we obtain when we successfully match the query with the key.

The query represents the vector we want to associate with all input values. In our earlier example, we used the decoder token as the query and the encoder tokens as the values. In our current example, we're focusing on understanding the connections of a specific word, like "it," with other vectors.

Ideally, the most significant connection should be with a word like "robot" because "it" refers to "robot." Queries and values are two different ways of looking at the same thing. Values are vector representations of related tokens used to measure the similarity between the query and those values.

Now, let's delve deeper into the mechanics of self-attention and explore how these key concepts interplay in creating a dynamic and flexible attention mechanism.

The role of keys in the self-attention mechanism

In our previous example, when we have word embeddings or a recurrent model, we can calculate the dot product between the decoder query and all encoder embeddings. That's all we need. We set up the keys, which are another way of representing or projecting words. Each projection captures specific features or aspects of the word.

For instance, we might project words as "nouns," "verbs," or "adjectives." Each projection represents a different feature of the word. Think of these projections as channels in a convolution, where each channel represents a different feature.

Keys play a crucial role, especially in multihead attention, which we'll discuss later. In the self-attention mechanism, our goal is to represent each token in relation to all other tokens in a sentence. For simplicity, in self-attention, we assume that the query, keys, and values are the same. This is why it's called self-attention. Each one is a different projection, not an exact copy, but derived from the same source through different weight matrices.

Get hands-on with 1400+ tech skills courses.