...
/Deploying the System Design of a Text-to-Image Generation System
Deploying the System Design of a Text-to-Image Generation System
Understand the System Design for a text-to-image generation model, focusing on detailed components like prompt processing, enhancement systems, and dynamic contextualization to align outputs with user intent.
We'll cover the following...
Deploying a powerful text-to-image generation model like Stable Diffusion 3.5 Large requires careful consideration of the infrastructure and resources needed to make it available to the public. Before designing the deployment infrastructure, estimating the storage capacity, necessary number of servers, and bandwidth is crucial. Let’s start with the storage estimation:
Storage estimation
Storage estimation includes model size, user profile and interaction data, along with the indexing storage. Let’s estimate all these resources considering 100 million daily active users:
- Model size estimation: The Stable Diffusion 3.5 Large has significant memory and computational requirements. We’ll consider the half-precision floating-point format (FP16) to estimate the model’s size, which takes 16 bits (2 bytes) per parameter. This yields us the following size: 
- User profile data: For storing users’ data, assume that each user’s data takes approximately 10 KB, translating to: 
Note: The model’s size (16.2 GB) and user profile data (1 TB) will remain constant unless the number of users increases. Therefore, we don’t include them in the storage required per day or month.
- User interaction data: If we store each user interaction data, the storage per interaction will depend on the size of each interaction and the image size produced during each interaction. Assume that each user interacts 10 times per day with the system, and the model generates a - 1 MB image - Assume that the image generated by the model is 512x512 = 262,144 pixels. Each colored pixel takes 3 bytes, so an image's total storage would be approximately 262,144 x 3B ≈1MB. 
- Indexing storage: To make user interactions searchable and efficiently accessible, additional storage for indexing and metadata would be required. This might add another 25% of storage. 
So, the total storage requirement for user interaction per day would be:
According to the above estimates, the monthly storage requirement is:
Let’s move on to the estimation of inference servers:
Inference servers estimation
At 100 million DAUs, each user generating around 10 images daily, the estimated Total Requests Per Second (TRPS) is approximately 11574. Similarly, using our proposed inference formula, an average query’s inference time for an 8.1 billion model (for 100 iterations) is approximately 5.2 milliseconds. This time is estimated using FP16 precision on an NVIDIA A100 GPU. According to this estimation, the QPS for NVIDIA server having A100 GPU will be approximately 192 QPS, which yields us the following number of GPUs:
We need 61 GPUs to handle 11574 text-to-image requests per second.
Bandwidth estimation
Assuming that each request takes approximately 2 KB of size, for 11574 TRPS, ingress bandwidth will be estimated to:
Assume the response size is 1 MB for an image request with dimensions of 512x512, including the associated metadata. This gives us the following egress bandwidth:
So, we have the following number of resource estimations:
- Storage required: 
- Inference servers with GPUs: 
- Ingress bandwidth: 
- Egress bandwidth: 
Until now, we have estimated the required resources for deploying an image ...