Representing Text

Learn and compare the different methods for representing text.

Language is one of the most complex aspects of our existence. We use language to communicate our thoughts and choices. Every language is defined by a list of characters called the alphabet, a vocabulary, and a set of rules called grammar. Yet, it is not a trivial task to understand and learn a language. Languages are complex and have fuzzy grammatical rules and structures.

Text is a representation of language that helps us communicate and share. This makes it a perfect area of research to expand the horizons of what artificial intelligence can achieve. Text is a type of unstructured data that cannot directly be used by any of the known algorithms. Machine learning and deep learning algorithms, in general, work with numbers, matrices, vectors, and so on. This, in turn, raises the question: how can we represent text for different language-related tasks?

Bag of Words

As we mentioned earlier, every language consists of a defined list of characters (alphabet), which are combined to form words (vocabulary). Traditionally, a Bag of Words (BoW) has been one of the most popular methods for representing textual information.

BoW is a simple and flexible approach to transforming text into vector form. This transformation helps not only in extracting features from raw text, but also in making it fit for consumption by different algorithms and architectures. As the name suggests, the BoW model of representation utilizes each word as a basic unit of measurement. A BoW model describes the occurrence of words within a given corpus of text. To build a BoW model for representation, we require two major things:

  • Vocabulary: A collection of known words from the corpus of text to be analyzed.

  • Measure of occurrence: Something that we choose based on the application/task at hand. For instance, counting the occurrence of each word, known as term frequency, is one such measure.

A detailed discussion related to the BoW model is beyond the scope of this chapter. We are presenting a high-level overview as a primer before more complex topics are introduced.

The BoW model is called a “bag” to highlight the simplicity and the fact that we overlook any ordering of the occurrences. In other words, the BoW model discards any order or structure-related information of the words in a given text. This might sound like a big issue, but until recently, the BoW model remained quite a popular and effective choice for representing textual data.

Let’s have a quick look at a few examples to understand how this simple method works.

“Some say the world will end in fire,

Some say in ice.
From what I have tasted of desire
I hold with those who favor fire.”

The preceding snippet is a short excerpt from the poem Fire and Ice by Robert Frost. We’ll use these few lines of text to understand how the BoW model works. The following is a step-by-step approach:

  1. Define a vocabulary:

The first and foremost step is to define a list of known words from our corpus. For ease of understanding and practical reasons, we can ignore the case and punctuation marks for now. The vocabulary, or unique words, thus are {some, say, the, world, will, end, in, fire, ice, from, what, I, have, tasted, of, desire, hold, with, those, who, favor}.

This vocabulary is a set of 21 unique words in a corpus of 26 words.

  1. Define a metric of occurrence:

Once we have the vocabulary set, we need to define how we will measure the occurrence of each word from the vocabulary. As we mentioned earlier, there are a number of ways to do so. One such metric is simply checking if a specific word is present or absent. We use a 00 if the word is absent or a 11 if it is present.

Get hands-on with 1400+ tech skills courses.