The Skip-Gram Algorithm
Learn about the skip-gram Word2vec algorithm.
The first algorithm we’ll talk about is known as the skip-gram algorithm, which is a type of Word2vec algorithm. As we have discussed in numerous places, the meaning of a word can be elicited from the contextual words surrounding it. However, it isn’t entirely straightforward to develop a model that exploits this way of learning word meanings. The skip-gram algorithm, introduced by Mikolov et al. in 2013, is an algorithm that exploits the context of the words in a written text to learn good word embeddings.
Let’s go through the skip-gram algorithm step by step. First, we’ll discuss the data preparation process. Understanding the format of the data puts us in a great position to understand the algorithm. We’ll then discuss the algorithm itself. Finally, we’ll implement the algorithm using TensorFlow.
From raw text to semistructured text
First, we need to design a mechanism to extract a dataset that can be fed to our learning model. Such a dataset should be a set of tuples of the format (target, context). Moreover, this needs to be created in an unsupervised manner. That is, a human should not have to manually engineer the labels for the data. In summary, the data preparation process should do the following:
- Capture the surrounding words of a given word (that is, the context).
- Run in an unsupervised manner.
The skip-gram model uses the following approach to design a dataset:
-
For a given word , a context window size of is assumed. By “context window size,” we mean the number of words considered as context on a single side. Therefore, for , the context window (including the target word ) will be of size and will look like this:
-
Next, (target, context) tuples are formed as ...