...

/

Quantization: Reducing the Model Size

Quantization: Reducing the Model Size

Learn about the quantization techniques and understand how they reduce the size of the model.

Generative AI has been revolutionized in recent years, with LLMs evolving rapidly to become more powerful than ever before. These models can now understand and respond to user queries in a human-like manner. They are capable of performing complex tasks like question-answering, text generation, sentiment analysis, code generation, image generation, and much more. With all this intelligence and advancements, these models are also getting bigger and more complex in terms of the number of parameters. For example, given below are the number of parameters of some widely used large language models:

Model

Number of Parameters

ChatGPT-4

1.7 trillion

Llama 3.1

405 billion

Gemini 1.0 ultra

175 billion

Mistral

7 billion

Bloom

176 billion

This growing complexity brings challenges, such as the memory required to train and deploy these large-scale models. As models expand, the demand for computational resources also increases, making it difficult to manage and deploy them efficiently. This raises a crucial question: how can we fine-tune large-scale models like GPT and Llama on task-specific data then?

This is where quantization comes into play, offering a solution to these challenges. Let’s get into the details of the quantization process and how it facilitates fine-tuning.

Quantization

Quantization is a technique for reducing the model size by compressing the model weights from a high-precision value to a low-precision value. The weights of a language model are the vectors that can be represented in different data types based on the availability of computational resources and required precision. The default data type of most of the models is a 32-bit floating number (float32), which means each weight of such models takes 4 bytes1 byte = 8 bits of space in the memory.

Quantization reduces the number of bits required for each weight of the model by changing the data type ...