In this shot, we are going to build an NLP engine that will show similarity between two given words.
For this, we are going to use Gensim’s word2vec
model. Gensim provides an optimum implementation of word2vec’s CBOW model and Skip-Gram model.
Before moving on, you need to download the word2vec vectors.
Click here to download the vectors. Remember the file size is ~1.5GB.
We suggest you work on Google Colab for this, as the file size is very large.
Open your Google Colab and run the command below to get your word vectors.
!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
This command will download it on Google servers and save a lot of time.
Now, let’s install the packages we require.
pip install gensim
pip install scikit-learn
You can run the above command in both Google Colab and on your local machine (if you’re using that).
Lets move on to the coding part by first importing the packages, as shown below.
from gensim.models import KeyedVectorsfrom sklearn.metrics.pairwise import cosine_similarityprint('Imported Successfully!')
We imported two packages. These packages will be used in the following way:
gensim
package will be used to load the word vectors that we downloaded.KeyedVectors
essentially contain the mapping between word and embedding. After training, it can be used to directly query those embedding in various ways.scikit-learn's
cosine similarity to calculate the distance between two words. This distance metric is commonly used and provides good results for many types of problems.word_vectors = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)v_apple = word_vectors['banana']v_mango = word_vectors['mango']cosine_similarity([v_apple],[v_mango])
Explanation:
banana
and mango
.cosine_similarity()
function and computed the similarity by passing the two vectors.You will see an output similar to this below.
array([[0.63652116]], dtype=float32)
The above means that both of the words are around 63% similar.
Note: if you try to get the vectors for words that are not in the vocabulary, you will get an error. You can solve this by training the model using your dataset.
This is how transfer learning is implemented in NLP. If you want to learn more about transfer learning, check out the shots below: