Grokking the Generative AI System Design/

...

Deploying the System Design of a Text-to-Video Generation System

Understand the System Design of a text-to-video generation system.

We'll cover the following...

Storage estimation
System Design of the text-to-video generation system
- The high-level design
  - Achieving functional requirements
- The detailed System Design
Fulfilling nonfunctional requirements
Conclusion

In the previous lesson, we chose a model similar to Mochi 1https://huggingface.co/genmo/mochi-1-preview for the text-to-video generation system and presented the training process and the required resources. In this lesson, our focus is on the deployment infrastructure for such a model. We estimate various resources, followed by design considerations and a detailed System Design.

Let’s start with the storage estimation:

Storage estimation

Storage estimation includes model size, user profile and related data, and indexing storage. Let’s estimate all these resources considering 100 million daily active users:

Model size estimation: We are considering a similar model to Mochi 1, which has approximately 10 billion parameters. For FP16 floating-point precision, the model size becomes:

Note: The model and the encoder sizes (29.4 GB) and user profile data (1 TB) will remain constant unless we upgrade the model or the number of users increases. Therefore, we don’t include them in the subsequent storage estimation.

User interaction data: If we store each user interaction data, the storage per interaction will depend on each interaction’s size and the video produced. Assume that the model is set to generate a 5-second video with a resolution of 480p at a frame-rate of 30 frames per second for a single request. A single 480p frame at a standard widescreen aspect ratio of 16:9 translates to around $854\times480=409,920 \text{ pixels}$ . This results in a size of 1.25 MBFileSize for a single generated video. Suppose that each user interacts 10 times per day with the system, so for 100 million users, the storage required would be:

Inference servers estimation

At 100 million DAUs, each user generating around 10 videos daily, the estimated Total Requests Per Second (TRPS) is approximately 11574. Using our proposed inference formula for generating a 5-second video at 30 frames per second, an average query’s inference time is 0.75 seconds. We assume the model takes 50 iterations per frame, resulting in 1500 iterations ( $\text{C}$ ) for 30 frames per second. This time is estimated using a 14.7 billion parameters model (Mochi-1 and T5-XXL) using the FP16 precisionFor FP16 precision the NVIDIA A100 GPU speed is 312 TFLOPS. on the NVIDIA A100 GPU.

Introduction to GenAI System Design

Fundamental Concepts in GenAI

Back-of-the-envelope Calculations

Systematic Framework for Designing GenAI Systems

System Design of a Text-to-Text Generation System

ChatGPT

System Design of a Text-to-Image Generation System

DALL·E

System Design of a Text-to-Speech Generation System

ElevenLabs

System Design of a Text-to-Video Generation System

Sora

Conclusion

Deploying the System Design of a Text-to-Video Generation System

Storage estimation

Inference servers estimation