Co-occurrence matrix

We have already studied a structure that could contain this information in the previous chapter: the co-occurrence matrix. Let’s refresh our memory on the co-occurrence matrix because GloVe uses the statistics captured in the co-occurrence matrix to compute vectors.

Co-occurrence matrices encode the context information of words, but they require maintaining a $V × V$ matrix, where $V$ is the size of the vocabulary. To understand the co-occurrence matrix, let’s take two example sentences:

Jerry and Mary are friends.
Jerry buys flowers for Mary.

If we assume a context window of size 1 on each side of a chosen word, the co-occurrence matrix will look like the following (we only show the upper triangle of the matrix because the matrix is symmetric):

We can see that this matrix shows us how a word in a corpus is related to any other word, meaning it contains global statistics about the corpus. That said, what are some of the advantages of having a co-occurrence matrix as opposed to seeing just the local context?

It provides us with additional information about the characteristics of the words. For example, if we consider the sentence “The cat sat on the mat,” it’s difficult to say if “the” is a special word that appears in the context of words such as “cat” or “mat.” However, if we have a large enough corpus and a co-occurrence matrix, it’s very easy to see that “the” is a frequently occurring stop word.
The co-occurrence matrix recognizes the repeating usages of contexts or phrases, whereas, in the local context, this information is ignored. For example, in a large enough corpus, “New York” will be a clear winner, showing that the two words appear in the same context many times.

It’s important to keep in mind that Word2vec algorithms use various techniques to approximately inject some word co-occurrence patterns while learning word vectors. For example, the subsampling technique we already used (i.e., sampling lower-frequency words more) helps to detect and avoid stop words. But they introduce additional hyperparameters and are not as informative as the co-occurrence matrix.

Using global statistics to come up with word representations is not a new concept. An algorithm known as latent semantic analysis (LSA) has been using global statistics in its approach.

LSA is used as a document analysis technique that maps words in the documents to something known as a concept, a common pattern of words that appears in a document. Global matrix factorization-based methods efficiently exploit the global statistics of a corpus (for example, the co-occurrence of words in a global scope) but have been shown to perform poorly at word analogy tasks.

On the other hand, context window-based methods have been shown to perform well at word analogy tasks but do not utilize global statistics of the corpus, leaving space for improvement. GloVe attempts to get the best of both worlds—an approach that efficiently leverages global corpus statistics while optimizing the learning model in a context window-based manner similar to skip-gram or CBOW.

GloVe, a new technique for learning word embeddings, was introduced in the paper GloVe—Global Vectors for Word RepresentationPennington et al. (https://nlp.stanford.edu/pubs/glove.pdf). GloVe attempts to bridge the gap of missing global co-occurrence information in Word2vec algorithms. The main contribution of GloVe is a new cost function (or an objective function) that uses the valuable statistics available in the co-occurrence matrix. Let’s first look at the motivation behind the GloVe method.

	Jerry	and	Mary	are	friends	buys	flowers	for
Jerry	0	1	0	0	0	1	0	0
and		0	1	0	0	0	0	0
Mary			0	1	0	0	0	1
are				0	1	0	0	0
friends					0	0	0	0
buys						0	1	0
flowers							0	1
for								0

Introduction to Natural Language Processing

Understanding TensorFlow 2

Word2vec: Learning Word Embeddings

Advanced Word Vector Algorithms

Sentence Classification with Convolutional Neural Networks

Recurrent Neural Networks

Understanding Long Short-Term Memory Networks

Applications of LSTM: Generating Text

Sequence-to-Sequence Learning: Neural Machine Translation

Transformers

Sarcasm Classification Using BERT

Image Captioning with Transformers

Caption Generation Using PyTorch

Final Remarks

Appendix: Mathematical Foundations and Advanced TensorFlow

GloVe: Global Vectors Representation

Co-occurrence matrix

Symmetric Upper Triangle Matrix

Understanding GloVe