CBOW and skip-gram are models of the Word2vec framework used in natural language processing. Word2Vec is a neural network model for word embeddings. Before diving into explaining what are embeddings, I have a question for you. How do we make machines understand text?
The core idea behind word embeddings is to convert text to numerical data (vector space) and capture the semantic as well as syntactic meaning of words and their relationships with other words in a corpus.
In neural network models like CBOW and skip-gram, the input layer is fed with the word embeddings. We generally use the one-hot encoding representation of the textual data to train the neural network models.
The Continuous Bag of Words (CBOW) is a Word2Vec model that predicts a target word based on the surrounding context words. It takes a fixed-sized context window of words and tries to predict the target word in the middle of the window. The model learns by maximizing the probability of predicting the target word correctly given the context words.
Let’s take a look at a simple example.
Let's say we have the sentence, "I eat Pizza on Friday". First, we will tokenize the sentence: ["I", "eat", "pizza","on", "Friday"]. Now, let's create the training examples for this sentence for the CBOW model, considering a window size of 2.
Training example 1: Input: ["I", "pizza"], Target: "eat".
Training example 2: Input: ["eat", "on"], Target: "pizza".
Training example 3: Input: ["pizza", "Friday"], Target:"on".
In CBOW, there are typically three main layers involved: the input layer, the hidden layer, and the output layer.
Skip-gram is another neural network architecture used in Word2Vec that predicts the context of a word, given a target word. The input to the skip-gram model is a target word, while the output is a set of context words. The goal of the skip-gram model is to learn the probability distribution of the context words, given the target word.
During training, the skip-gram model is fed with a set of target words and their corresponding context words. The model learns to adjust the weights of the hidden layer to maximize the probability of predicting the correct context words, given the target word.
Let’s take a look at the same example discussed above.
The sentence was, "I eat Pizza on Friday". First, we will tokenize the sentence: ["I", "eat", "pizza", "on", "Friday"]. Now, let's create the training examples for this sentence for the skip-gram model, considering a window size of 2.
Training example 1: Input: "eat", Target: ["I", "pizza"].
Training example 2: Input: "pizza", Target: ["eat", "on"].
Training example 3: Input: "on", Target: ["pizza", "Friday"].
Training example 4: Input: "Friday", Target: ["on"].
Skip-gram and CBOW both aim to learn the word embeddings that capture semantic relationships between words. Both Skip-gram and CBOW are shallow neural network models with an input layer, hidden layer (word embeddings), and output layer, but skip-gram models tend to have larger sizes due to more parameters involved in predicting multiple target words.
Here's a table summarizing the differences between Skip-gram and CBOW.
Skip-gram | CBOW | |
Architecture | Predicts context words given a target word | Predicts a target word given context words |
Context Size | Handles large context windows (e.g., 5-20 words) | Handles smaller context windows (e.g., 2-5 words) |
Training | Slower training time due to multiple target predictions | Faster training time due to single target prediction |
Performance | Performs well with rare words and captures word diversity | Performs well with frequent words and captures word similarity |
Word Vector | Dense word vectors with high dimensionality (100-300) | Dense word vectors with high dimensionality (100-300) |
Model Size | Larger model size due to more parameters | Smaller model size due to fewer parameters |
Note: The choice between skip-gram and CBOW depends on the specific task and dataset size. Generally, skip-gram is used when the dataset is large. Therefore, it performs well with rare words and captures word diversity.
In this answer, we discussed the CBOW and skip-gram models of the Word2Vec framework. Both CBOW and skip-gram provide different approaches to word embeddings, offering a trade-off between training efficiency, semantic capture, and the ability to handle different dataset characteristics. Therefore, choosing the appropriate algorithm among them requires careful consideration of the specific requirements of your task.
Pop question
What is the main difference between CBOW and Skip-gram?
CBOW predicts context words given a target word.
Skip-gram predicts a target word given context words.
CBOW has a larger context window compared to skip-gram
Skip-gram is faster in training compared to CBOW.