Vector Databases for Large Language Models (LLMs)/

...

Introduction to Large Language Models (LLMs)

Learn about large language models (LLMs), their architecture, applications, and the challenges of LLMs during the training process.

We'll cover the following...

What are LLMs?
Examples of LLMs
- GPT models
- BERT models
Architecture of LLMs
Applications of LLMs
Challenges in training of LLMs

What are LLMs?

LLMs are statistical models designed for several purposes, such as generating text, understanding and translating human language, and extracting meaningful information from large amounts of data. These models can predict the probability of different sentence structures, and based on these probability values, they can generate text.

LLMs are AI-based models designed using deep neural networks, which contain multiple parameters. These parameters are trained on very large datasets of text, images, audio, or videos as input to learn the ability to generate the desired output.

Press + to interact

In AI, the training prepares the model to identify the patterns inside the input data to perform the prediction. In generative AI, these predictions are further used to generate user output. LLMs are often trained using self-supervised learning techniques, where the data provides supervision. In self-supervised learning, the model learns to predict parts of the input from other parts, essentially generating its labels from the data. For instance, in the context of language models, the model might predict the next word in a sentence given the previous words. This differs from supervised learning, where the model is trained on labeled datasets with explicit input-output pairs provided by humans. Self-supervised learning allows LLMs to leverage vast amounts of unlabeled data, making it a powerful approach for training large-scale models.

Examples of LLMs

The GPT and BERT series are among the most widely used LLMs available today. Let's discuss them one by one.

GPT models

It is a series of generative pretrained transformer models designed by OpenAI over the past few years. The objective behind developing these models is to learn and write human-like text.

Press + to interact

Comparison of BERT models

	Developed by	Year	Architecture	Parameters	Training Data
BERT	Google	2018	12 layers (base), 24 layers (large)	110 million – 340 million	BooksCorpus (800 million words) + English Wikipedia (2.5 billion words).
RoBERTa	Meta AI	2019	12 layers (base), 24 layers (large)	125 million – 355 million	160 GB of text from sources like Common Crawl, Wikipedia.
DistilBERT	Hugging Face	2019	6 layers	66 million	Same as BERT dataset.
ALBERT	Google and TTIC	2020	12 layers	11 million – 223 million	Same as BERT dataset.
BioBERT	Korea University and Clova Research	2019	12 layers (base), 24 layers (large)	110 million – 345 million	4.5 billion words from PubMed abstracts and 13.5 billion words from PMC full-text articles
ClinicalBERT	MIT and Harvard	2019	12 layers	110 million	MIMIC-III clinical notes (approx. 880 million to 1.2 billion words

Architecture of LLMs

LLMs like BERT and GPT are based on the transformer architecture, which uses self-attention mechanisms to process input sequences in parallel, allowing for efficient training on large datasets. The general architecture of LLMs comprises the following layers:

Embedding layers: These are the set of layers that transform the input text into a numerical representation, known as vectors that can be processed by LLMs for training purposes. LLMs understand words and connect them with each other through these layers. A clear distinction between standalone embedding methods and the contextual embedding approach used by LLMs lies in their consideration of word meaning in context. More advanced systems, such as ELMo and BERT, take into account the context of a word in a sentence to determine its actual meaning. By capturing both the general meaning of words and how they're used in context, embedding layers are essential for LLMs to "speak" human language fluently.
Feedforward layers: These fully connected layers are used to process the transformed data by applying weights and activation functions to capture patterns and relationships within the input text. The activation functions such as ReLU play a vital role in these layers. Their role is to introduce non-linearity and allow the model to see how the words connect in more interesting ways. It's not just about single word anymore, these layers can see how these words work together to create bigger ideas and even emotions in the text!
Attention layers: These layers are utilized to focus on significant part(s) of the query. The attention mechanism improve deep learning models by focusing on most important part of the sentence, refining accuracy and productivity. They focus on the important information to boost overall performance of the model. There are multiple types of attention mechanisms, including but not limited to self-attention and multi-head attention.
- Self-attention mechanisms allow the model to focus on different parts of the input sequence to understand the relationships between words.
- Multi-head attention involves using multiple self-attention mechanisms in parallel to capture diverse aspects of the input.
Normalization layers: Normalization layers enhance the model's performance and training stability by standardizing inputs and outputs to consistent ranges, improving convergence, and preventing issues like exploding or vanishing gradients. These layers standardize the input data or intermediate outputs of the network to a consistent range or distribution. The primary goals of normalization layers are to accelerate training, improve convergence, and help in preventing issues like exploding or vanishing gradients.
Output layer: The output layer is the final layer that transforms the model's internal representations into human-readable text. In generation tasks (like GPT), it predicts the probability of the next word in the sequence. For models like BERT, the output layer is used to generate predictions for tasks such as classification or named entity recognitionNamed Entity Recognition (NER) is a fundamental NLP task that involves identifying and classifying entities in text into predefined categories..

Press + to interact

Note: All these layers work together, step-by-step, to unlock the meaning in language for LLMs.

Applications of LLMs

There is a variety of LLM applications that are widely used in the world. The few most commonly known applications are as follows:

Text generation: It is a very commonly used application of LLMs nowadays. The large language models are extensively used to generate the human like text. For instance, authors and storytellers can use this tool to generate stories and their plot outlines. Similarly, the content creators use it to draft articles.

Press + to interact

Challenges in training of LLMs

Below are a few challenges that organizations usually face during the training process of LLMs.

High budget: Training an LLM can be expensive even for larger organizations. It requires highly expensive computing resources, continuous power and network availability, and damage and disaster control systems installations. Specialized personnel hiring also adds up to an additional cost.
Giant corpus: LLMs are trained on large datasets, that usually take up gigabytes and terabytes amount of space in memory.
Enormous training time: Due to a large number of parameters and a vast dataset, LLMs take several weeks and even months to get the model fully trained.
Data quality and diversity: We need to ensure that the training data is of high quality and diverse enough to cover various use cases. Poor quality data can introduce biases and inaccuracies, while a lack of diversity can lead to models that perform poorly on underrepresented scenarios.
Model interpretability and explainability: LLMs are often considered "black boxes" due to their complex architectures. Understanding and explaining their decision-making processes can be difficult, which is a barrier to their adoption in sensitive or high-stakes applications.

These are few challenges that organizations usually face during the training process of LLMs. These days, fortunately, the advancements in cloud computing technology have made it more accessible for large organizations to afford the computational resources needed for LLMs training in order to deal with large corpora and large training times.

	Year	Architecture	Parameters	Training Data
GPT-1	2018	Transformer, 12-layer	117 million	BooksCorpus (7,000 unpublished books)
GPT-2	2019	Transformer, 48-layer	1.5 billion	8 million web documents
GPT-3 (ChatGPT)	2020	Transformer, 96-layer	175 billion	570 GB of internet data
GPT-4	2023	Multimodal Transformer	not disclosed	Larger dataset, including text and images

Getting Started

Vector Databases

Guide to Generate Embeddings and Store in ChromaDB

Conclusion

Introduction to Large Language Models (LLMs)

What are LLMs?

Examples of LLMs

GPT models

Comparison of GPT models

BERT models

Comparison of BERT models

Architecture of LLMs

Applications of LLMs

Challenges in training of LLMs