...

/

Deploying the System Design of a Text-to-Text Generation System

Deploying the System Design of a Text-to-Text Generation System

Understand the System Design for deploying a text-to-text generation model like ChatGPT.

In the previous lesson, we covered the training process and the evaluation of the Llama 3.2 GenAI model, resulting in a fully trained and production-ready model. With the model prepared, the next critical step is deployment—making it accessible to users at scale. Cloud service providers like AWS, GCP, and Microsoft Azure offer intuitive platforms that simplify deployment for third-party users. However, we will aim to build the system from scratch to get a solid understanding of various design decisions while ensuring it meets the specific demands of the system.

Deploying large-scale models demands a robust System Design that covers resource estimation, scaling, high availability, fault tolerance, and reliability. These systems require diverse resources, including compute power, storage capacity, and servers capable of performing complex and varied tasks like load balancing, content filtering, inference processing, etc.

In this lesson, we’ll build the System Design for deploying a text-to-text (conversational) model, beginning with estimating the resources required for system deployment. From there, we’ll explore how different components are integrated into a robust, efficient, and scalable architecture that can support real-world use cases.

We will perform some back-of-the-envelope-calculation for the following resources:

  • Storage required for the model to ensure fast and reliable access for inference

  • Storage needed for users’ metadata and their interactions with the system

  • Compute or inference servers for handling a large number of requests from users

  • Network bandwidth/capacity for seamless communication

Let’s start with the model size estimations!

Model size estimation

We established in the previous lesson that we will be using a model similar to Llama 3.2 3B to design a text-to-text generation system. To estimate the size of the model, we’ll consider the precision and parameter count as follows:

Data precision can be of FP64 (64-bit)Double-precision floating-point format, FP32 (32-bit)Single-precision floating-point format, FP16(16-bit)Half-precision floating-point format, and quantized (8-bit)A quantized model is a neural network where the precision of weights and activations is reduced (e.g., from 32-bit to 8-bit), minimizing memory and computation requirements while maintaining accuracy.. This precision shows the size of one parameter. For example, if we use FP16 (16-bit) precision, the size of one parameter will be 2 bytes. So, according to the formulation above, the 3B model size would be:

Note: We will consider FP16 precision for a model with 3 billion parameters, translating to approximately 6 GB of memory. While higher precision ensures greater accuracy, it increases the model’s size, potentially slowing down performance. We can trade off some accuracy for improved performance to optimize our system.

The following table demonstrates the model size using FP32, FP16, and quantized 8-bit data precision for the Llama 3.2 3B model.

Let’s approximate other resources before moving to the infrastructure for deployment.

Resource estimation

Resource estimation ensures the infrastructure can handle expected workloads efficiently while remaining cost-effective. It helps determine the computing power, storage, and network requirements for scalability, reliability, and performance. Accurate estimation prevents over-provisioning, which increases costs, and under-provisioning, which can lead to system failures or poor user experiences.

Estimating resources for deploying models, such as a 3B parameter model, requires making some initial assumptions. For instance, assuming 100 million daily active users (DAU) allows us to estimate key factors like storage, server capacity for inference, and bandwidth requirements, though these numbers will be refined as the system scales.

Storage estimation

In addition to storing the model itself, we must store users’ data, interactions with the model, and model metadata. Other elements like context information, logs, prompt templates, or configuration files may also need to be stored depending on the system. While there are many components to consider for a fully operational system, we will focus on the key aspects for brevity.

  • User profile data: For storing users’ metadata, assume that each user’s data takes approximately 10 KB, translating to:

Note: The model’s size (6 GB) and user profile data (1 TB) will remain constant unless the number of users increases. Therefore, we don’t include them in the subsequent storage estimation.

  • User interaction data: If we store each user interaction data, the storage per interaction will depend on the size of each interaction. Assume that each user interacts 10 times daily with the system, consuming 2 KB of space per interaction. For 100 M users, this storage requirement per day would be:

  • Indexing storage: We would need additional storage for indexing the user interaction data for fast retrieval. Let’s assume an average storage increase of 25% for indexing.

So, the total storage requirement for user interaction per day would be:

Typically, production-grade setups require redundant storage to ensure high availability and reliability. Depending on the data’s sensitivity, multiple variants may be saved in different locations across distributed systems. Therefore, the storage requirements for real systems may multiply.

The following calculator estimates storage for different types of data. We can change the numbers (cells in yellow) and see the change reflected in the final storage.

Storage Estimation for Different Types of Data

ABC
1No. of users in millions100Million
2Size of each user's data in KB10KB
3User profile dataf1TB
4No. of user interactions per day10Per day
5Space taken by each interaction2KB
6Size of the total user interaction dataf2TB
7Addition Indexing storage (percentage)25%
8Model size in GB6GB
9Total estimated storagef2.5TB

According to the above ...

Access this course and 1400+ top-rated courses and projects.