How to find similarity between two words using NLP

In this shot, we are going to build an NLP engine that will show similarity between two given words.

For this, we are going to use Gensim’s word2vec model. Gensim provides an optimum implementation of word2vec’s CBOW model and Skip-Gram model.

Similarity between two words

Before moving on, you need to download the word2vec vectors.

Click here to download the vectors. Remember the file size is ~1.5GB.

We suggest you work on Google Colab for this, as the file size is very large.

Open your Google Colab and run the command below to get your word vectors.

!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

This command will download it on Google servers and save a lot of time.

Now, let’s install the packages we require.

pip install gensim
pip install scikit-learn

You can run the above command in both Google Colab and on your local machine (if you’re using that).

Lets move on to the coding part by first importing the packages, as shown below.

from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity
print('Imported Successfully!')

We imported two packages. These packages will be used in the following way:

  • The gensim package will be used to load the word vectors that we downloaded.
  • KeyedVectors essentially contain the mapping between word and embedding. After training, it can be used to directly query those embedding in various ways.
  • We will use scikit-learn's cosine similarity to calculate the distance between two words. This distance metric is commonly used and provides good results for many types of problems.
word_vectors = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)
v_apple = word_vectors['banana']
v_mango = word_vectors['mango']
cosine_similarity([v_apple],[v_mango])

Explanation:

  • In line 1, we loaded the word2vec model. This is the word2vec model that was trained on the Google News dataset. It has been trained to create vectors of 300 dimensions.
  • In line 3 and line 4, we tried to get the word vectors for banana and mango.
  • In line 5, we used the cosine_similarity() function and computed the similarity by passing the two vectors.

You will see an output similar to this below.

array([[0.63652116]], dtype=float32)

The above means that both of the words are around 63% similar.

Note: if you try to get the vectors for words that are not in the vocabulary, you will get an error. You can solve this by training the model using your dataset.

This is how transfer learning is implemented in NLP. If you want to learn more about transfer learning, check out the shots below:

  1. What is transfer learning and why is it needed?
  2. What are the strategies for using transfer learning?
  3. How to use a pre-trained deep learning model