Quantized Low-Rank Adaptation (QLoRA)
Learn about the components and working of the Quantized Low-Rank Adaptation (QLoRA) technique.
Quantized Low-Rank Adaptation (QLoRA), as the name suggests, combines the two most widely used methods of fine-tuning, i.e., LoRA and quantization. Where LoRA uses the low-rank matrices to reduce the number of trainable parameters, QLoRA extends it by further reducing the model size by quantizing its weights.
Components of QLoRA
The following are the three main components of QLoRA:
4-bit NormalFloat quantization
Double quantization
Paged optimizers
Let’s dive into the details of each component
4-bit NormalFloat quantization
The NormalFloat (NF) data type is a theoretically optimal data type that uses
QLoRA uses a special type of quantization called 4-bit NormalFloat (NF4) quantization, which compresses the model’s weights from a 32-bit floating point to a 4-bit format. Model weights, which tend to follow a normal distribution (most values are near zero), are first scaled to fit within the range of