Grokking the Generative AI System Design/

...

Deploying the System Design of a Text-to-Image Generation System

Understand the System Design for a text-to-image generation model, focusing on detailed components like prompt processing, enhancement systems, and dynamic contextualization to align outputs with user intent.

We'll cover the following...

Storage estimation
Inference servers estimation
Bandwidth estimation
Design consideration for the system
System Design of the text-to-image system
- High-level design
  - Achieving functional requirements
Detailed System Design
Fulfilling nonfunctional requirements
Conclusion

Deploying a powerful text-to-image generation model like Stable Diffusion 3.5 Large requires careful consideration of the infrastructure and resources needed to make it available to the public. Before designing the deployment infrastructure, estimating the storage capacity, necessary number of servers, and bandwidth is crucial. Let’s start with the storage estimation:

Storage estimation

Storage estimation includes model size, user profile and interaction data, along with the indexing storage. Let’s estimate all these resources considering 100 million daily active users:

Model size estimation: The Stable Diffusion 3.5 Large has significant memory and computational requirements. We’ll consider the half-precision floating-point format (FP16) to estimate the model’s size, which takes 16 bits (2 bytes) per parameter. This yields us the following size:

Note: The model’s size (16.2 GB) and user profile data (1 TB) will remain constant unless the number of users increases. Therefore, we don’t include them in the storage required per day or month.

User interaction data: If we store each user interaction data, the storage per interaction will depend on the size of each interaction and the image size produced during each interaction. Assume that each user interacts 10 times per day with the system, and the model generates a 1 MB imageAssume that the image generated by the model is 512x512 = 262,144 pixels. Each colored pixel takes 3 bytes, so an image's total storage would be approximately 262,144 x 3B ≈1MB. for the user; this gives us the following:

Let’s move on to the estimation of inference servers:

Inference servers estimation

At 100 million DAUs, each user generating around 10 images daily, the estimated Total Requests Per Second (TRPS) is approximately 11574. Similarly, using our proposed inference formula, an average query’s inference time for an 8.1 billion model (for 100 iterations) is approximately 5.2 milliseconds. This time is estimated using FP16 precision on an NVIDIA A100 GPU. According to this estimation, the QPS for NVIDIA server having A100 GPU will be approximately 192 QPS, which yields us the following number of GPUs:

Press + to interact

Introduction to GenAI System Design

Fundamental Concepts in GenAI

Back-of-the-envelope Calculations

Systematic Framework for Designing GenAI Systems

System Design of a Text-to-Text Generation System

ChatGPT

System Design of a Text-to-Image Generation System

DALL·E

System Design of a Text-to-Speech Generation System

ElevenLabs

System Design of a Text-to-Video Generation System

Sora

Conclusion

Deploying the System Design of a Text-to-Image Generation System

Storage estimation

Inference servers estimation

Bandwidth estimation