CBOW vs skip-gram

CBOW and skip-gram are models of the Word2vec framework used in natural language processing. Word2Vec is a neural network model for word embeddings. Before diving into explaining what are embeddings, I have a question for you. How do we make machines understand text?

The core idea behind word embeddings is to convert text to numerical data (vector space) and capture the semantic as well as syntactic meaning of words and their relationships with other words in a corpus.

In neural network models like CBOW and skip-gram, the input layer is fed with the word embeddings. We generally use the one-hot encoding representation of the textual data to train the neural network models.

CBOW

The Continuous Bag of Words (CBOW) is a Word2Vec model that predicts a target word based on the surrounding context words. It takes a fixed-sized context window of words and tries to predict the target word in the middle of the window. The model learns by maximizing the probability of predicting the target word correctly given the context words.

Let’s take a look at a simple example.

Example

Let's say we have the sentence, "I eat Pizza on Friday". First, we will tokenize the sentence: ["I", "eat", "pizza","on", "Friday"]. Now, let's create the training examples for this sentence for the CBOW model, considering a window size of 2.

  • Training example 1: Input: ["I", "pizza"], Target: "eat".

  • Training example 2: Input: ["eat", "on"], Target: "pizza".

  • Training example 3: Input: ["pizza", "Friday"], Target:"on".

In CBOW, there are typically three main layers involved: the input layer, the hidden layer, and the output layer.

CBOW Architecture
CBOW Architecture

Skip-gram

Skip-gram is another neural network architecture used in Word2Vec that predicts the context of a word, given a target word. The input to the skip-gram model is a target word, while the output is a set of context words. The goal of the skip-gram model is to learn the probability distribution of the context words, given the target word.

During training, the skip-gram model is fed with a set of target words and their corresponding context words. The model learns to adjust the weights of the hidden layer to maximize the probability of predicting the correct context words, given the target word.

Let’s take a look at the same example discussed above.

Example

  • The sentence was, "I eat Pizza on Friday". First, we will tokenize the sentence: ["I", "eat", "pizza", "on", "Friday"]. Now, let's create the training examples for this sentence for the skip-gram model, considering a window size of 2.
    Training example 1: Input: "eat", Target: ["I", "pizza"].

  • Training example 2: Input: "pizza", Target: ["eat", "on"].

  • Training example 3: Input: "on", Target: ["pizza", "Friday"].

  • Training example 4: Input: "Friday", Target: ["on"].

Skip-gram Architecture
Skip-gram Architecture

CBOW vs Skip-gram

Skip-gram and CBOW both aim to learn the word embeddings that capture semantic relationships between words. Both Skip-gram and CBOW are shallow neural network models with an input layer, hidden layer (word embeddings), and output layer, but skip-gram models tend to have larger sizes due to more parameters involved in predicting multiple target words.

Here's a table summarizing the differences between Skip-gram and CBOW.

CBOW vs Skip-gram


Skip-gram

CBOW

Architecture


Predicts context words given a target word


Predicts a target word given context words

Context Size


Handles large context windows (e.g., 5-20 words)


Handles smaller context windows (e.g., 2-5 words)

Training


Slower training time due to multiple target predictions



Faster training time due to single target prediction

Performance

Performs well with rare words and captures word diversity



Performs well with frequent words and captures word similarity

Word Vector


Dense word vectors with high dimensionality (100-300)



Dense word vectors with high dimensionality (100-300)

Model Size


Larger model size due to more parameters



Smaller model size due to fewer parameters

Note: The choice between skip-gram and CBOW depends on the specific task and dataset size. Generally, skip-gram is used when the dataset is large. Therefore, it performs well with rare words and captures word diversity.

Conclusion

In this answer, we discussed the CBOW and skip-gram models of the Word2Vec framework. Both CBOW and skip-gram provide different approaches to word embeddings, offering a trade-off between training efficiency, semantic capture, and the ability to handle different dataset characteristics. Therefore, choosing the appropriate algorithm among them requires careful consideration of the specific requirements of your task.

Pop question

Q

What is the main difference between CBOW and Skip-gram?

A)

CBOW predicts context words given a target word.

B)

Skip-gram predicts a target word given context words.

C)

CBOW has a larger context window compared to skip-gram

D)

Skip-gram is faster in training compared to CBOW.

Free Resources