Pretraining Paradigms
Explore how pretraining shapes foundation models, covering masked, causal, and contrastive learning techniques.
Modern foundation models like GPT can feel a bit mysterious at first. You might hear terms like masked language modeling, autoregressive next-token prediction, or contrastive learning and wonder—are these about how the model is trained or makes predictions? The truth is that the pretraining task shapes both.
The training process (How the model’s parameters are optimized)
The final behavior (How the trained model naturally generates text)
We’ll dissect these two facets, explaining why the pretraining technique is the core of how the model is trained and how that leads to the model’s eventual capabilities at inference.
How does pretraining define both training and inference?
One of the clearest illustrations of the connection between pretraining and the model’s behavior comes from the GPT series, described in Language Models Are Unsupervised Multitask Learners (Radford et al., 2019). During pretraining, GPT models ingest massive amounts of text (the authors used a dataset called WebText, consisting of high-quality websites). Rather than receiving explicit labels, GPT is tasked with next-token prediction, which means that, given all the words so far, it must guess the next word.
This objective—often called autoregressive—is how GPT learns. The model sees huge sequences of words, tries to predict each upcoming token, compares its guess to the actual next token, and adjusts its parameters to reduce the discrepancy. The entire training loop is anchored in this objective, repeated billions of times on billions of tokens. By the end of training, GPT has gained a deep statistical sense of language.
However, this same next-token task also characterizes how GPT works at inference. When you type a question or start a sentence, GPT tries to continue it with the most likely subsequent words according to the patterns it gleaned from pretraining. If you give GPT the prompt, “Translate this sentence into French: I love cats,” the model is still predicting the next token—but it has learned through its broad training data that a likely continuation is the French translation of “I love cats.” The model’s final usage or working phase applies the same mechanism governing every training step: predict what should come next.
Now that we understand how an autoregressive objective drives GPT’s core behavior, it helps to see how this next-token approach gets enforced in practice. That’s where loss functions and optimization come in.
What is the role of loss functions and optimization during the pretraining stage?
A large model like GPT doesn’t magically know when it’s right or wrong; it needs a loss function to measure each mistake. In GPT’s next-token scenario, every mismatch between the model’s prediction and the actual token in the text increases the loss. This ...