Representing Text

Learn and compare the different methods for representing text.

Language is one of the most complex aspects of our existence. We use language to communicate our thoughts and choices. Every language is defined by a list of characters called the alphabet, a vocabulary, and a set of rules called grammar. Yet, it is not a trivial task to understand and learn a language. Languages are complex and have fuzzy grammatical rules and structures.

Text is a representation of language that helps us communicate and share. This makes it a perfect area of research to expand the horizons of what artificial intelligence can achieve. Text is a type of unstructured data that cannot directly be used by any of the known algorithms. Machine learning and deep learning algorithms, in general, work with numbers, matrices, vectors, and so on. This, in turn, raises the question: how can we represent text for different language-related tasks?

Bag of Words

As we mentioned earlier, every language consists of a defined list of characters (alphabet), which are combined to form words (vocabulary). Traditionally, a Bag of Words (BoW) has been one of the most popular methods for representing textual information.

BoW is a simple and flexible approach to transforming text into vector form. This transformation helps not only in extracting features from raw text, but also in making it fit for consumption by different algorithms and architectures. As the name suggests, the BoW model of representation utilizes each word as a basic unit of measurement. A BoW model describes the occurrence of words within a given corpus of text. To build a BoW model for representation, we require two major things:

  • Vocabulary: A collection of known words from the corpus of text to be analyzed.

  • Measure of occurrence: Something that we choose based on the application/task at hand. For instance, counting the occurrence of each word, known as term frequency, is one such measure.

A detailed discussion related to the BoW model is beyond the scope of this chapter. We are presenting a high-level overview as a primer before more complex topics are introduced.

The BoW model is called a “bag” to highlight the simplicity and the fact that we overlook any ordering of the occurrences. In other words, the BoW model discards any order or structure-related information of the words in a given text. This might sound like a big issue, but until recently, the BoW model remained quite a popular and effective choice for representing textual data.

Let’s have a quick look at a few examples to understand how this simple method works.

“Some say the world will end in fire,

Some say in ice.
From what I have tasted of desire
I hold with those who favor fire.”

The preceding snippet is a short excerpt from the poem Fire and Ice by Robert Frost. We’ll use these few lines of text to understand how the BoW model works. The following is a step-by-step approach:

  1. Define a vocabulary:

The first and foremost step is to define a list of known words from our corpus. For ease of understanding and practical reasons, we can ignore the case and punctuation marks for now. The vocabulary, or unique words, thus are {some, say, the, world, will, end, in, fire, ice, from, what, I, have, tasted, of, desire, hold, with, those, who, favor}.

This vocabulary is a set of 21 unique words in a corpus of 26 words.

  1. Define a metric of occurrence:

Once we have the vocabulary set, we need to define how we will measure the occurrence of each word from the vocabulary. As we mentioned earlier, there are a number of ways to do so. One such metric is simply checking if a specific word is present or absent. We use a 00 if the word is absent or a 11 if it is present.

Press + to interact
An example of defining a metric of occurrence
An example of defining a metric of occurrence

There are a few other metrics that have been developed over the years. The most widely used metrics are:

  • Term frequency (TF)

  • Term frequency-inverse document frequency (TF-IDF)

  • Hashing

These steps provide a high-level glimpse into how the BoW model helps us represent textual data as numbers or vectors. The overall vector representation of our excerpt from the poem is depicted in the following table:

Press + to interact
BoW representation
BoW representation

Each row in the matrix corresponds to one line from the poem, while the unique words from the vocabulary form the columns. Therefore, each row is simply the vector representation of the text under consideration.

There are a few additional steps involved in improving the outcome of this method. The refinements are related to vocabulary and scoring aspects. Managing the vocabulary is very important; often, a corpus of text can increase in size quite rapidly. A few common methods of handling vocabularies are:

  • Ignoring punctuation marks

  • Ignoring case

  • Removing frequently occurring words (or stopwords) like 'a', 'an', 'the', 'this', and so on

  • Methods to use the root form of words, such as stop in place of stopping. Stemming and lemmatization are two such methods

  • Handling spelling mistakes

We have already discussed different scoring methods and how they help in capturing certain important features. BoW is simple, yet is an effective tool that serves as a good starting point for most NLP tasks. Yet there are a few issues which can be summarized as follows:

  • Missing context: As we mentioned earlier, the BoW model does not consider the ordering or structure of the text. By simply discarding information related to ordering, the vectors lose out on capturing the context in which the underlying text was used. For instance, the sentences “I am sure about it” and “Am I sure about it?” would have identical vector representations, yet they express different thoughts. Expanding BoW models to include n-grams (contiguous terms) instead of singular terms does help in capturing some context but in a very limited way.

  • Vocabulary and sparse vectors: As the corpus size increases, so does the vocabulary. The steps required to manage vocabulary size require a lot of oversight and manual effort. Due to the way this model works, a large vocabulary leads to very sparse vectors. Sparse vectors pose issues with modeling and computation requirements (space and time). Aggressive pruning and vocabulary management steps do help to a certain extent but can lead to the loss of important features as well.

Here, we discussed how the BoW model helps transform text into vector form and a few issues with this setup. In the next section, we will move on to a few more involved representation methods that alleviate some of these issues.

Distributed representation

The Bag of Words model is an easy-to-understand way of transforming words into vector form. This process is generally termed vectorization. While it is a useful method, the BoW model has its limitations when it comes to capturing context, along with sparsity-related issues. Since deep learning architectures are becoming de facto state-of-the-art systems in most spaces, it is obvious that we should be leveraging them for NLP tasks as well. Apart from the issues mentioned earlier, the sparse and large (wide) vectors from the BoW model are another aspect that can be tackled using neural networks.

A simple alternative that handles the sparsity issue can be implemented by encoding each word as a unique number. Continuing with the example from the previous section, “some say ice,” we could assign 11 to “some,” 22 to “say,” 33 to “ice," and so on. This would result in a dense vector [1,2,3][1, 2, 3]. This is an efficient utilization of space, and we end up with vectors where all the elements are full. However, the limitation of missing context still remains. Since the numbers are arbitrary, they hardly capture any context on their own. On the contrary, arbitrarily mapping numbers to words is not very interpretable.

Interpretability is an important requirement when it comes to NLP tasks. For computer vision use cases, visual cues are good enough indicators for understanding how a model is perceiving or generating outputs (though quantification is also a problem there, we can skip it for now). For NLP tasks, since the textual data is first required to be transformed into a vector, it is important to understand what those vectors capture and how they are used by the models.

In the coming sections, we'll cover some of the popular vectorization techniques that try to capture context while limiting the sparsity of the vectors as well.

Please note that there are a number of other methods (such as SVDSingular value decomposition is a matrix factorization technique, which decomposes any matrix into 3 generic and familiar matrices.-based methods and co-occurrence matrices) that help in vectorizing textual data. In this section, we'll cover only those that are helpful in understanding later sections of this chapter.

word2vec

The English Oxford dictionary has about 600k unique words and is growing year on year. Yet those words are not independent terms; they have some relationship to each other. The premise of the word2vec model is to learn high-quality vector representations that capture context. This is better summarized by the famous quote by J.R. Firth:

“You shall know a word by the company it keeps.”

In their work titled "Efficient Estimation of Word Representations in Vector SpaceMikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” ArXiv.org. September 7, 2013. https://arxiv.org/abs/1301.3781.," Mikolov et al. present two different models that learn vector representations of words from a large corpus. Word2Vec is a software implementation of these models, which is classified as an iterative approach to learning such embeddings. Instead of taking the whole corpus into account in one go, this approach tries to iteratively learn to encode each word’s representation, along with its context. This idea of learning word representations as dense context vectors is not a new one. It was first proposed by Rumelhart et al. in 1990Hinton, G, J Mcclelland, and D Rumelhart. n.d. “:J CHAPTER Distributed Representations.” Accessed March 12, 2024. https://web.stanford.edu/~jlmcc/papers/PDP/Chapter3.pdf.. They presented how a neural network is able to learn representations with similar words ending up in the same clusters. The ability to have vector forms of words that capture some notion of similarity is quite a powerful one. Let’s see in detail how the word2vec models achieve this.

Continuous Bag of Words (CBOW) model

The Continuous Bag of Words (CBOW) model is an extension of the Bag of Words model we discussed in the previous section. The key aspect of this model is the context window. A context window is defined as a sliding window of a fixed size moving along a sentence. The word in the middle is termed the target, and the terms to its left and right within the window are the context terms. The CBOW model works by predicting the target term, given its context terms.

For instance, let’s consider a reference sentence, “some say the world will end in fire.” If we have a window size of 4 and a target term of the world, the context terms would be {say, the} and {will, end}. The model inputs are tuples of the form (context terms, ...