Vectorizing Language
Explore how transforming sparse text into dense word embeddings revolutionized NLP and powered modern GenAI.
We saw how traditional methods—rule-based systems, a bag of words, TF-IDF, and n-gram models—represented text using word frequencies or counts. While these methods enabled machines to perform tasks like text classification or next-word prediction, they treated words as isolated tokens. This meant synonyms such as “cat” and “feline” remained completely distinct despite their similar meanings. Frequency-based techniques might notice that both often appear near words like “purr” or “whiskers,” but they couldn’t unify them conceptually. Similarly, synonyms such as “great,” “terrific,” and “awesome” were not recognized as sharing similar sentiments.
This gap—capturing the meaning and relationships between words—led researchers to develop word embeddings. Word embeddings represent words as vectors in a continuous, high-dimensional space, where the geometry of that space reflects semantic relationships. For example, embeddings can capture analogies like "king" - "man" + "woman" ≈ "queen," something that frequency-based methods can’t achieve.
How have word embeddings changed NLP?
Traditional methods like a bag of words and TF-IDF yield sparse representations—imagine a library catalog with thousands of categories where most entries are empty. In contrast, word embeddings map each word to a dense vector (often 100–300 dimensions). Think of each dimension like an ingredient in a recipe: if you have 300 different ingredients, each contributes a unique flavor. In practice, these dimensions emerge during training, capturing hidden linguistic features—like topic or sentiment—so having many dimensions lets the model distinguish subtle differences among words. That’s why words with similar meanings—like ‘cat’ and ‘feline’—end up close together in this high-dimensional space.