...

Quantization: Reducing the Model Size

Learn about the quantization techniques and understand how they reduce the size of the model.

We'll cover the following...

Quantization
Why use quantization?
Types of quantization
- Post-training quantization (PTQ)
- Quantization-aware training (QAT)
Techniques for model quantization

Generative AI has been revolutionized in recent years, with LLMs evolving rapidly to become more powerful than ever before. These models can now understand and respond to user queries in a human-like manner. They are capable of performing complex tasks like question-answering, text generation, sentiment analysis, code generation, image generation, and much more. With all this intelligence and advancements, these models are also getting bigger and more complex in terms of the number of parameters. For example, given below are the number of parameters of some widely used large language models:

This growing complexity brings challenges, such as the memory required to train and deploy these large-scale models. As models expand, the demand for computational resources also increases, making it difficult to manage and deploy them efficiently. This raises a crucial question: how can we fine-tune large-scale models like GPT and Llama on task-specific data then?

This is where quantization comes into play, offering a solution to these challenges. Let’s get into the details of the quantization process and how it facilitates fine-tuning.

Quantization

Quantization is a technique for reducing the model size by compressing the model weights from a high-precision value to a low-precision value. The weights of a language model are the vectors that can be represented in different data types based on the availability of computational resources and required precision. The default data type of most of the models is a 32-bit floating number (float32), which means each weight of such models takes 4 bytes1 byte = 8 bits of space in the memory.

Quantization reduces the number of bits required for each weight of the model by changing the data type ...

Model	Number of Parameters
ChatGPT-4	1.7 trillion
Llama 3.1	405 billion
Gemini 1.0 ultra	175 billion
Mistral	7 billion
Bloom	176 billion

Getting Started

Basics of Fine-Tuning

Exploring LoRA

Wrap Up

Quantization: Reducing the Model Size

Quantization