Natural Language Processing with TensorFlow/

...

The Skip-Gram Algorithm

Learn about the skip-gram Word2vec algorithm.

We'll cover the following...

From raw text to semistructured text
Understanding the skip-gram algorithm

The first algorithm we’ll talk about is known as the skip-gram algorithm, which is a type of Word2vec algorithm. As we have discussed in numerous places, the meaning of a word can be elicited from the contextual words surrounding it. However, it isn’t entirely straightforward to develop a model that exploits this way of learning word meanings. The skip-gram algorithm, introduced by Mikolov et al. in 2013, is an algorithm that exploits the context of the words in a written text to learn good word embeddings.

Let’s go through the skip-gram algorithm step by step. First, we’ll discuss the data preparation process. Understanding the format of the data puts us in a great position to understand the algorithm. We’ll then discuss the algorithm itself. Finally, we’ll implement the algorithm using TensorFlow.

From raw text to semistructured text

First, we need to design a mechanism to extract a dataset that can be fed to our learning model. Such a dataset should be a set of tuples of the format (target, context). Moreover, this needs to be created in an unsupervised manner. That is, a human should not have to manually engineer the labels for the data. In summary, the data preparation process should do the following:

Capture the surrounding words of a given word (that is, the context).
Run in an unsupervised manner.

The skip-gram model uses the following approach to design a dataset:

For a given word $w_i$ , a context window size of $m$ is assumed. By “context window size,” we mean the number of words considered as context on a single side. Therefore, for $w_i$ , the context window (including the target word $w_i$ ) will be of size $2m+1$ and will look like this: $[w_{i-m}, ..., w_{i-1}, w_i, w_{i+1}, ..., w_{i+m}].$
Next, (target, context) tuples are formed as $[..., (w_i, w_{i-m}), ..., (w_i, w_{i-1}), (w_i, w_{i+1}), ..., (w_i, w_{i+m}),...];$ ...

Introduction to Natural Language Processing

Understanding TensorFlow 2

Word2vec: Learning Word Embeddings

Advanced Word Vector Algorithms

Sentence Classification with Convolutional Neural Networks

Recurrent Neural Networks

Understanding Long Short-Term Memory Networks

Applications of LSTM: Generating Text

Sequence-to-Sequence Learning: Neural Machine Translation

Transformers

Sarcasm Classification Using BERT

Image Captioning with Transformers

Caption Generation Using PyTorch

Final Remarks

Appendix: Mathematical Foundations and Advanced TensorFlow

The Skip-Gram Algorithm

From raw text to semistructured text