In the previous lesson, we covered the training process and the evaluation of the Llama 3.2 GenAI model, resulting in a fully trained and production-ready model. With the model prepared, the next critical step is deployment—making it accessible to users at scale. Cloud service providers like AWS, GCP, and Microsoft Azure offer intuitive platforms that simplify deployment for third-party users. However, we will aim to build the system from scratch to get a solid understanding of various design decisions while ensuring it meets the specific demands of the system.

Deploying large-scale models demands a robust System Design that covers resource estimation, scaling, high availability, fault tolerance, and reliability. These systems require diverse resources, including compute power, storage capacity, and servers capable of performing complex and varied tasks like load balancing, content filtering, inference processing, etc.

In this lesson, we’ll build the System Design for deploying a text-to-text (conversational) model, beginning with estimating the resources required for system deployment. From there, we’ll explore how different components are integrated into a robust, efficient, and scalable architecture that can support real-world use cases.

We will perform some back-of-the-envelope-calculation for the following resources:

  • Storage required for the model to ensure fast and reliable access for inference

  • Storage needed for users’ metadata and their interactions with the system

  • Compute or inference servers are required to handle a large number of requests from users

  • Network bandwidth/capacity for seamless communication

Let’s start with the model size estimations!

Model size estimation

We established in the previous lesson that we will be using a model similar to Llama 3.2 3B to design a text-to-text generation system. To estimate the size of the model, we’ll consider the precision and parameter count as follows:

Get hands-on with 1400+ tech skills courses.