Pretraining Paradigms: How Do Models Learn?
Explore how models learn, what paradigms they follow, and why it’s crucial for building foundation models.
Our previous lesson discussed how foundation models have revolutionized AI by offering a single, flexible platform for various tasks. Now, the big question is: how do we get from raw data to a trained system capable of all these amazing feats? How do these models go from a blank slate to the impressive general-purpose tools we see today? That’s exactly what we’ll explore in this lesson.
How does a model learn?
Before we begin, let’s try to understand what it means to train a model. Training a model means teaching a computer to recognize patterns by showing many examples. It starts with random guesses and then improves by adjusting its brain (weights) to make better predictions over time.
Imagine teaching a kid to recognize cats:
You show them many pictures of cats and tell them that this is a cat.
At first, they might guess wrong—maybe they think a dog is a cat.
Every time they make a mistake, you correct them.
Over time, they get better and can recognize cats independently.
A computer does the same thing using math and data instead of human intuition! There is a catch, though. Notice that in the example above (point 3), every time they make a mistake, we correct them. This can mean one of two things:
Either, we already know how to recognize cats.
Or, the pictures of the animals have labels that tell us what the image is about.
In an ideal world, we would have infinite time to teach everyone how to recognize cats, but this approach would not scale well. An easier approach to this would be to have a lot of pictures of cats labeled as a “cat” and show them to the kid. The kid could first try to make a guess and then read what the image was about. This would allow the kid to learn well, given that we have our labeled data. This technique is supervised learning, where the model learns by looking at labeled examples (where both the input and correct output are known).
Dataset bias: If a model only sees well-lit cat images during training, it might fail to recognize cats in shadowy or low-light conditions. Diversifying the training dataset is key for robust image recognition.
Working with unlabeled data
Here’s another reality check: Most of the data available is not labeled. Unsupervised learning is a technique that involves a model trying to find hidden patterns or groupings in data without labels. This can be good for discovering natural clusters, like grouping customers based on purchasing habits without knowing their demographics. However, this technique mainly finds clusters but doesn’t learn rich representations. A rich representation means the model has learned deep, useful, and generalizable features that allow it to understand concepts beyond memorizing examples. Let’s understand this with an example.
Imagine teaching another kid to recognize cats:
The kid looks at a big pile of animal pictures—some cats, dogs, and rabbits, but no labels.
They start grouping similar-looking animals (all animals with pointy ears and whiskers, all animals with long ears and hops, and ones who bark and have floppy ears).
They form natural clusters—one group for cats, one for rabbits, and one for dogs.
Later, someone tells the kid, “Oh, this group is called cats!” Now, the kid recognizes cats, even without someone explicitly teaching the kid.
Until someone told the kid what each group was, they did not know. Furthermore, perhaps they only ...