...

/

Overview of Large Language Models (LLMs)

Overview of Large Language Models (LLMs)

Learn about foundation models, their downstream tasks, how they scale, and why they hallucinate.

Transformer architecture in LLMs

The transformer architecture revolutionized the way we train natural language processing models by parallelizing the input words or tokens using attention techniques, making it more efficient. Transformer models are trained using large amounts of data, and we are talking about billions of tokens, not thousands. In that sense, scale is an essential and important factor. The more data the model is trained on, the better it gets in principle. Since the models are trained on this large amount of data, we refer to them as large language models (LLMs). LLMs are pretrained on large amounts of data and can be fine-tuned to many specific and specialized downstream tasks with a relatively small dataset.

Since LLMs have the capacity to transfer their learning to downstream tasks, and because of their versatility, we also refer to them as foundation models. We can distinguish between two types of transformer-based LLMs and their downstream tasks. Models such as BERT, which utilizes the encoder part of the transformer, and models such as GPT, which utilizes the decoder part of the transformer. BERT is designed to understand the context in text bi-directionally, which is ideal for tasks such as sentiment analysis, text classification, and topic modeling. In healthcare, BERT is being used to analyze electronic health records for insights, helping in the early diagnosis of diseases such as diabetes or heart conditions. On the other hand, GPT is a decoder-focused model which excels in generating coherent and contextually relevant text which is ideal for applications in language generation and chatbots. GPT like models have advanced the field of language translation, enabling real-time context-aware translation services that are breaking down the language barriers.

We increasingly see LLMs, such as GPT, built using only the decoder part of the transformers, as in the image below.

Press + to interact
Transformer architecture with decoder-only component
Transformer architecture with decoder-only component

These decoder-only LLMs insinuate that the models are already pretrained, and as a consequence, we can use only the decoder part afterwards to infer the text, such as predicting the next word in a sentence.

Scaling LLMs

Through experiments, it was ...

Access this course and 1400+ top-rated courses and projects.