...

/

GloVe: Global Vectors Representation

GloVe: Global Vectors Representation

Learn about the GloVe algorithm for word embeddings.

We'll cover the following...

One of the main limitations of the skip-gram and CBOW algorithms is that they can only capture local contextual information because they only look at a fixed-length window around a word. There’s an important part of the puzzle missing here because these algorithms don’t look at global statistics (by global statistics, we mean a way for us to see all the occurrences of words in the context of another word in a text corpus).

Co-occurrence matrix

We have already studied a structure that could contain this information in the previous chapter: the co-occurrence matrix. Let’s refresh our memory on the co-occurrence matrix because GloVe uses the statistics captured in the co-occurrence matrix to compute vectors.

Co-occurrence matrices encode the context information of words, but they require maintaining a V×VV × V matrix, where VV is the size of the vocabulary. To understand the co-occurrence matrix, let’s take two example sentences:

  • Jerry and Mary are friends.
  • Jerry buys flowers for Mary.

If we assume a context window of size 1 on each side of a chosen word, the co-occurrence matrix will look like the following (we only show the upper triangle of the matrix because the matrix is symmetric):

Symmetric Upper Triangle Matrix


Jerry

and

Mary

are

friends

buys

flowers

for

Jerry

0

1

0

0

0

1

0

0

and


0

1

0

0

0

0

0

Mary



0

1

0

0

0

1

are




0

1

0

0

0

friends





0

0

0

0

buys






0

1

0

flowers







0

1

for








0

We can see that this matrix shows us how a word in a corpus is related to any other word, meaning it contains global statistics about the corpus. That said, what are some of the advantages of having a co-occurrence matrix as opposed to seeing just the local context?

  • It provides us with additional information about the characteristics of the words. For example, if we consider the sentence “The cat sat on the mat,” it’s difficult to say if “the” is a special word that appears in the context of words such as “cat” or “mat.” However, if we have a large enough corpus and a co-occurrence matrix, it’s very easy to see that “the” is a frequently occurring stop word.

  • The co-occurrence matrix recognizes the repeating usages of contexts or phrases, whereas, in the local context, this information is ignored. For example, in a large enough corpus, “New York” will be a clear winner, showing that the two words appear in the same context many times.

It’s important to keep in mind that Word2vec algorithms use various techniques to approximately inject some word co-occurrence patterns while learning word vectors. For example, the subsampling technique we already used (i.e., sampling lower-frequency words more) helps to detect and avoid stop words. But they introduce additional hyperparameters and are not as informative as the co-occurrence matrix.

Using global statistics to come up with word representations is not a new concept. An algorithm known as latent semantic analysis (LSA) has been using global statistics in its approach.

LSA is used as a document analysis technique that maps words in the documents to something known as a concept, a common pattern of words that appears in a document. Global matrix factorization-based methods efficiently exploit the global statistics of a corpus (for example, the co-occurrence of words in a global scope) but have been shown to perform poorly at word analogy tasks.

On the other hand, context window-based methods have been shown to perform well at word analogy tasks but do not utilize global statistics of the corpus, leaving space for improvement. GloVe attempts to get the best of both worlds—an approach that efficiently leverages global corpus statistics while optimizing the learning model in a context window-based manner similar to skip-gram or CBOW.

GloVe, a new technique for learning word embeddings, was introduced in the paper GloVe—Global Vectors for Word RepresentationPennington et al. (https://nlp.stanford.edu/pubs/glove.pdf). GloVe attempts to bridge the gap of missing global co-occurrence information in Word2vec algorithms. The main contribution of GloVe is a new cost function (or an objective function) that uses the valuable statistics available in the co-occurrence matrix. Let’s first look at the motivation behind the GloVe method.

Understanding GloVe

Before looking at the implementation details of GloVe, let’s take time to understand the concepts governing the computations in GloVe. To do so, let’s consider an example:

  • Consider word ii = “Ice” and jj = “Steam.”
  • Define an arbitrary probe word kk.
  • Define PikP_{ik} to be the probability of words ii and kk occurring close to each other, and
...