Let’s begin by implementing the quantization on Meta’s Llama 3.1 model, which has 8 billion parameters.

First, we need to set up an account on Hugging Face to access the model. Hugging Face is an open-source machine learning platform where users can explore, share, deploy, and collaborate on thousands of models, datasets, and applications. This allows users to download and fine-tune models on various downstream tasks such as text summarization, question-answering, etc. Meta AI has made all variants of its Llama models, including Llama 3.1, accessible on Hugging Face.

Let’s create an account and get a token for Hugging Face:

  • Visit the Hugging Face website and create an account.

  • Visit the Access Token page and create a token using the “New token” button.

Request Llama 3.1 from Meta

Meta requires users to submit an access request form before downloading the model weights. Go to the models page of Hugging Face and choose the meta-llama/Meta-Llama-3.1-8B-Instruct model. Fill out the access request form for the selected model. Once Meta grants access to the model, we can download it.

Note: Usually, the repo authors grant access to the model within an hour.

Now we are all set for quantizing the model.

Install the dependencies

First, let’s install the following libraries to implement quantization. We are installing the latest versions (at the time of writing) of the libraries.

Get hands-on with 1400+ tech skills courses.