Grokking the Generative AI System Design/

...

/

Deploying the System Design of a Text-to-Text Generation System

In the previous lesson, we covered the training process and the evaluation of the Llama 3.2 GenAI model, resulting in a fully trained and production-ready model. With the model prepared, the next critical step is deployment—making it accessible to users at scale. Cloud service providers like AWS, GCP, and Microsoft Azure offer intuitive platforms that simplify deployment for third-party users. However, we will aim to build the system from scratch to get a solid understanding of various design decisions while ensuring it meets the specific demands of the system.

In this lesson, we’ll build the System Design for deploying a text-to-text (conversational) model, beginning with estimating the resources required for system deployment. From there, we’ll explore how different components are integrated into a robust, efficient, and scalable architecture that can support real-world use cases.

Considering the BOTECs chapter, we start with the different resources estimation, including:

Storage estimation
Inference servers
Network bandwidth

Let’s dive into the details of each of the above:

Storage estimation

Storage estimation includes model size, user profile and interaction data, and indexing storage:

Model size: In the previous lesson, we established that we would use a model similar to Llama 3.2 3B to design a text-to-text generation system. For the 3 billion parameters model, 6 GB of storage would be required for FP16 floating-point precision.

Let’s move on to the estimation of inference servers.

Inference servers estimation

For daily 100 M users, the total number of requests per second (TRPS) is 11,574 for each user making 10 requests per day. Similarly, using our proposed inference formula, an average query’s inference time for a 3B model (for 500 tokens) is approximately 9.6 milliseconds. This time is estimated using FP16 precision on an NVIDIA A100 GPU. According to this estimation, the QPSQueries per second (QPS) is a metric used in online systems to measure the number of queries a server receives per second. for an NVIDIA server having A100 GPU will be 104 QPSQPS =1/(inference time)=1/9.6 ms = 104, which yields us the following number of inference servers:

Introduction to GenAI System Design

Fundamental Concepts in GenAI

Back-of-the-envelope Calculations

Systematic Framework for Designing GenAI Systems

System Design of a Text-to-Text Generation System

ChatGPT

System Design of a Text-to-Image Generation System

DALL·E

System Design of a Text-to-Speech Generation System

ElevenLabs

System Design of a Text-to-Video Generation System

Sora

Conclusion

Deploying the System Design of a Text-to-Text Generation System

Storage estimation

Inference servers estimation

Bandwidth estimation