...

/

Deploying the System Design of a Text-to-Video Generation System

Deploying the System Design of a Text-to-Video Generation System

Understand the System Design of a text-to-video generation system.

In the previous lesson, we chose a model similar to Mochi 1https://huggingface.co/genmo/mochi-1-preview for the text-to-video generation system and presented the training process and the required resources. In this lesson, our focus is on the deployment infrastructure for such a model. We estimate various resources, followed by design considerations and a detailed System Design.

Let’s start with the storage estimation:

Storage estimation

Storage estimation includes model size, user profile and related data, and indexing storage. Let’s estimate all these resources considering 100 million daily active users:

  • Model size estimation: We are considering a similar model to Mochi 1, which has approximately 10 billion parameters. For FP16 floating-point precision, the model size becomes:

Mochi 1 uses the T5-XXL encoder for text encoding, which has approximately 4.7 billion parameters. Assuming the same FP16 precision, its size becomes 9.4 GB4.7 x 10^9 x 2 = 9.4 GB. So overall, we need 29.4 GB to store both the T5-XXL and Mochi 1 models.

  • User profile data: Assume that each user’s data takes approximately 10 KB, translating to the following:

Note: The model and the encoder sizes (29.4 GB) and user profile data (1 TB) will remain constant unless we upgrade the model or the number of users increases. Therefore, we don’t include them in the subsequent storage estimation.

  • User interaction data: If we store each user interaction data, the storage per interaction will depend on each interaction’s size and the video produced. Assume that the model is set to generate a 5-second video with a resolution of 480p at a frame-rate of 30 frames per second for a single request. A single 480p frame at a standard widescreen aspect ratio of 16:9 translates to around 854×480=409,920 pixels854\times480=409,920 \text{ pixels}. This results in a size of 1.25 MBFileSize for a single generated video. Suppose that each user interacts 10 times per day with the system, so for 100 million users, the storage required would be:

  • Indexing storage: To make the user interactions searchable and efficiently accessible, additional storage for indexing and metadata would be required, adding up to 25% of storage.

Note: Other than indexing, redundant storage is required for improved availability, low latency reading, and maybe even to serve videos of varied quality. Real-world systems tend to have much higher storage needs.

So, the total storage requirement per day would be:

According to the estimates given above, the monthly storage requirement is:

Inference servers estimation

At 100 million DAUs, each user generating around 10 videos daily, the estimated Total Requests Per Second (TRPS) is approximately 11574. Using our proposed inference formula for generating a 5-second video at 30 frames per second, an average query’s inference time is 0.75 seconds. We assume the model takes 50 iterations per frame, resulting in 1500 iterations (C\text{C}) for 30 frames per second. This time is estimated using a 14.7 billion parameters model (Mochi-1 and T5-XXL) using the FP16 precisionFor FP16 precision the NVIDIA A100 GPU speed is 312 TFLOPS. on the NVIDIA A100 GPU.

According to this estimation, the QPS for NVIDIA server having A100 GPU will be 1.41 QPS=1/0.71, which yields us the following number of ...