In natural language processing, word embeddings are a type of word representation that maps words into continuous vector spaces where semantically similar words are located closer together. This transformation into numerical vectors facilitates the processing of natural language by machine learning models and enhances the performance of various NLP tasks.
Word2Vec is a popular word embedding technique developed by Google. It consists of two main models: a continuous bag of words (CBOW) and a skip-gram. These models use shallow neural networks to learn word representations by predicting either a target word from its context (CBOW) or context words from a target word (Skip-gram).
Let’s understand the theoretical foundations of CBOW and skip-gram:
The CBOW seeks to determine the target word (the center word) in the context of nearby words (the surrounding words) within a preselected window size. The model is able to predict the target word based on its surrounding context, serving tasks when the meaning of words is well understood in relation to other words.
The following diagram shows a CBOW (continuous bag of words) architecture, which predicts a central word based on the surrounding words in a sentence.
On the other hand, skip-gram operates in a different manner. It predicts the context words based on the central word. This means that given a central word, skip-gram anticipates what words are likely to surround it. Because of this approach, skip-gram is particularly adept at capturing
The following diagram shows a skip-gram architecture, which predicts surrounding words based on a central target word in a sentence.
The following code demonstrates the process of creating word embedding models using gensim
to analyze the semantic relationships between words in the novel (moby.txt
). It reads the text from a file, cleans and tokenizes it into sentences and words, and converts them to lowercase. Two-word embedding models are created: continuous bag of words (CBOW) and skip-gram, both using a vector size of 200
and a context window of 7
words. The code then prints the cosine similarities between the words “whale" and “ship" and “whale" and “sea" for each model, highlighting the different ways these models capture word relationships.
# Import all necessary modulesimport gensimfrom gensim.models import Word2Vecfrom nltk.tokenize import sent_tokenize, word_tokenizeimport warningswarnings.filterwarnings(action='ignore')txt_file_path = "moby.txt"# Reads the text filewith open(txt_file_path, 'r', encoding='utf-8') as file: text = file.read()# Replaces escape characters with spacecleaned_text = text.replace("\n", " ")data = []# Iterate through each sentence in the textfor sentence in sent_tokenize(cleaned_text):temp = []# Tokenize the sentence into wordsfor word in word_tokenize(sentence):temp.append(word.lower())data.append(temp)# Create the CBOW modelcbow_model = gensim.models.Word2Vec(data, min_count=1, vector_size=200, window=7)print("Continuous Bag of Words (CBOW)")# Print resultsprint("Cosine similarity between 'whale' and 'ship' : ", cbow_model.wv.similarity('whale', 'ship'))print("Cosine similarity between 'whale' and 'sea' : ", cbow_model.wv.similarity('whale', 'sea'))# Create the skip-gram modelskipGram_model = gensim.models.Word2Vec(data, min_count=1, vector_size=200, window=7, sg=2)print("\nSkip Gram")# Print resultsprint("Cosine similarity between 'whale' and 'ship' : ", skipGram_model.wv.similarity('whale', 'ship'))print("Cosine similarity between 'whale' and 'sea' : ", skipGram_model.wv.similarity('whale', 'sea'))
Here’s the explanation of the above code implementation:
Lines 2–5: These lines import the required modules for the Word2Vec implementation:
gensim
: This is a library for topic modeling, document indexing, and similarity retrieval with a large collection.
Word2Vec
from gensim.models
: This is the Word2Vec model for training and working with word embeddings.
sent_tokenize
and word_tokenize
from nltk.tokenize
: These are functions for tokenizing text into sentences and words, respectively.
warnings
: This is a Python standard library module to handle warnings.
Line 7: This line suppresses warnings that might occur during the execution of the code. It’s often used to ignore unnecessary warning messages.
Lines 9–12: Here, the code specifies the path to the text file (moby.txt
) containing the corpus for training the Word2Vec model. It then reads the contents of the file into the variable text
.
Line 15: This line removes escape characters (like newline \n
) from the text and replaces them with spaces. It ensures that the text is clean and ready for tokenization.
Lines 17–27: These lines tokenize the cleaned text into sentences using sent_tokenize
, and then tokenize each sentence into words using word_tokenize
. It also converts each word to lowercase to ensure consistency.
Line 30: This line creates a continuous bag of words (CBOW) model using the Word2Vec
class from gensim
. It specifies parameters such as min_count
(minimum frequency of a word), vector_size
(dimensionality of the word vectors), and window
(size of the context window).
Lines 32–36: These lines print the results of the CBOW model, specifically the cosine similarity between selected word pairs (whale
and ship
and whale
and sea
) using the similarity
method.
Line 39: Similar to CBOW, this line creates a skip-gram model by setting sg=2
in the parameters of the Word2Vec
class (setting sg=2
in the parameters of the Word2Vec
class indicates that we want to create a skip-gram model for training).
Lines 41–45: These lines print the results of the skip-gram model, showing the cosine similarity between the same selected word pairs as in the CBOW model.
We took a close look at Word2Vec and its usage with gensim
library, exploring how it works. By training Word2Vec models on textual data, we might designate semantic similarities between words and perform different NLP tasks like sentiment analysis and text classification. As shown, Word2Vec provides a pathway to the involvement of natural language processing at a much higher level, allowing us to explore the hidden theories embedded in unstructured text data.