Vectorizing Language
Explore how transforming sparse text into dense word embeddings revolutionized NLP and powered modern GenAI.
Traditional NLP methods—such as rule-based systems, bag of words (BoW), TF-IDF, and n-gram models—represent text using word frequencies or counts. While these methods enable machines to perform tasks like text classification or next-word prediction, they have a fundamental flaw: they treat words as isolated tokens without capturing their relationships or meanings.
For example, synonyms such as “cat” and “feline” remained completely distinct, even though they describe the same animal. Frequency-based techniques might notice that both often appear near words like “purr” or “whiskers”, but they fail to unify them conceptually. Similarly, words like “great”, “terrific”, and “awesome” are not recognized as expressing similar sentiments.
This limitation exists because traditional methods treat words as independent entities, much like assigning ID numbers to people. Imagine a guest list at an event where each attendee has a unique badge number. The system can confirm whether a guest is present, but it has no way of knowing if two guests are friends, relatives, or complete strangers. Similarly, frequency-based methods count how often words appear but fail to capture any deeper relationships between them. This approach also suffers from the curse of dimensionality. As the vocabulary grows, these representations become huge and sparse, making computations inefficient. More importantly, because each word is treated as an isolated unit, the model loses the ability to generalize—meaning it requires massive amounts of data to learn even simple language patterns.
To overcome this limitation, researchers developed word embeddings—a way to represent words as vectors in a continuous, high-dimensional space. Unlike frequency-based methods, embeddings capture the relationships between words by positioning them in a shared space, where similar words end up closer together.
How have word embeddings changed NLP?
Traditional methods like BoW and TF-IDF create sparse representations—imagine a library catalog with thousands of categories, where most entries are empty. In contrast, word embeddings assign each word a dense vector (often 100–300 dimensions).
Think of each dimension like an ingredient in a recipe: if you have 300 different ingredients, each one adds a unique flavor. Similarly, the dimensions in a word embedding capture different linguistic features—such as topic, sentiment, or syntax. This enables embeddings to distinguish subtle differences among words and recognize that “cat” and “feline” belong close together in vector space.
Another way to think about this is by imagining organizing books in a library. Instead of simply alphabetizing titles, you group books by theme—placing mystery novels near detective stories and sci-fi books near space exploration guides. Word embeddings do something similar: they cluster words with related meanings in a vast, invisible semantic space. This extra resolution allows models to encode complex relationships between words, making NLP systems far more effective.
Note: You don’t need an in-depth understanding of machine learning (ML) and deep learning (DL). We’re mentioning them here to provide historical context. Detailed explanations will be provided later in the course as needed.
The shift to embeddings coincided with major advancements in ML and DL. As algorithms began to learn patterns from large amounts of data, techniques like backpropagation (imagine a teacher grading a test and returning it with corrections for the student to learn from) helped neural networks refine these embeddings. Don’t worry if this training process sounds mysterious; we’ll dive deeper into how neural networks learn later in the course. For now, a feedback loop repeatedly nudges the model toward a better understanding of language nuances. This leap set the stage for advanced NLP techniques such as Word2Vec and GloVe.
What is Word2Vec?
One of the landmark breakthroughs in word embeddings came with Word2Vec, introduced by a team at Google led by Tomas Mikolov in 2013. Unlike earlier approaches that focused on “here’s how often words appear,” Word2Vec learns embeddings by training a neural network to predict missing words (CBOW) or their surrounding context (Skip-Gram), capturing how often words appear together in a low-dimensional vector space. Because similar words end up with similar vector directions, we can even perform arithmetic-like operations—for example, subtracting "man" from "king" and adding "woman" to get a vector near "queen." The model learns these complex semantic relationships from millions of training examples, going well beyond simple co-occurrence counts.
This shift was significant because older frequency-based methods (including n-gram models) treated words as isolated tokens or short sequences. In contrast, Word2Vec uses a neural network to learn word representations, adjusting weights during training so that words with similar contexts receive similar embeddings.
Word2Vec is trained using one of two architectures: