...

/

Vectorizing Language

Vectorizing Language

Explore how transforming sparse text into dense word embeddings revolutionized NLP and powered modern GenAI.

We saw how traditional methods—rule-based systems, a bag of words, TF-IDF, and n-gram models—represented text using word frequencies or counts. While these methods enabled machines to perform tasks like text classification or next-word prediction, they treated words as isolated tokens. This meant synonyms such as “cat” and “feline” remained completely distinct despite their similar meanings. Frequency-based techniques might notice that both often appear near words like “purr” or “whiskers,” but they couldn’t unify them conceptually. Similarly, synonyms such as “great,” “terrific,” and “awesome” were not recognized as sharing similar sentiments.

This gap—capturing the meaning and relationships between words—led researchers to develop word embeddings. Word embeddings represent words as vectors in a continuous, high-dimensional space, where the geometry of that space reflects semantic relationships. For example, embeddings can capture analogies like "king" - "man" + "woman" ≈ "queen," something that frequency-based methods can’t achieve.

How have word embeddings changed NLP?

Traditional methods like a bag of words and TF-IDF yield sparse representations—imagine a library catalog with thousands of categories where most entries are empty. In contrast, word embeddings map each word to a dense vector (often 100–300 dimensions). Think of each dimension like an ingredient in a recipe: if you have 300 different ingredients, each contributes a unique flavor. In practice, these dimensions emerge during training, capturing hidden linguistic features—like topic or sentiment—so having many dimensions lets the model distinguish subtle differences among words. That’s why words with similar meanings—like ‘cat’ and ‘feline’—end up close together in this high-dimensional space.

Press + to interact

Another way to look at it is to imagine that you’re organizing books in a library. Instead of alphabetizing books, you group them based on themes—mystery novels near detective stories and sci-fi books near space exploration guides. Word embeddings do something similar: they place words with related meanings close together in a vast, invisible semantic space. This extra resolution allows the model to encode complex relationships between words.

Note: You don’t need an in-depth understanding of machine learning (ML) and deep learning (DL). We’re mentioning them here to provide historical context. Detailed explanations will be provided later in the course as needed.

The shift to embeddings coincided with major advancements in ML and DL. As algorithms began to learn patterns from large amounts of data, techniques like backpropagation (imagine a teacher grading a test and returning it with corrections for the student to learn from) helped neural networks refine these embeddings. Don’t worry if this training process sounds mysterious; we’ll dive deeper into how neural networks learn later in the course. For now, a feedback loop repeatedly nudges the model toward a better understanding of language nuances. This leap set the stage for advanced NLP techniques such as Word2Vec and GloVe.

What is Word2Vec?

One of the landmark breakthroughs in word embeddings came with Word2Vec, introduced by a team at Google led by Tomas Mikolov in 2013. Unlike earlier approaches that focused on “here’s how often words appear,” Word2Vec learns embeddings by training a neural network to predict missing words (CBOW) or their surrounding context (Skip-Gram), capturing how often words appear together in a low-dimensional vector space. Because similar words end up with similar vector directions, we can even perform arithmetic-like operations—for example, subtracting "man" from "king" and adding "woman" to get a vector near "queen." The model learns these complex semantic relationships from millions of training examples, going well beyond simple co-occurrence counts.

This shift was significant because older frequency-based methods (including n-gram models) treated words as isolated tokens or short sequences. In contrast, Word2Vec uses a neural network to learn word representations, adjusting weights during training so that words with similar contexts receive similar embeddings.

Word2Vec is trained using one of two architectures:

Continuous bag of words (CBOW)

It predicts the center word in a sequence based on its surrounding context. For example, in the sentence “The cat sat on the ___,” the model uses the context words [“The,” “cat,” “sat,” “on”] to predict the center word “mat.” Each word is represented as an embedding—a vector in a high-dimensional space—and these embeddings are averaged or summed to form a context vector. This context vector is then passed through a neural network, which outputs a probability distribution over the vocabulary, selecting the most likely center word.

Press + to interact
Continuous bag of words
Continuous bag of words

The following CBOW implementation is provided for demonstration purposes ...

Access this course and 1400+ top-rated courses and projects.