...
/Deploying the System Design of a Text-to-Video Generation System
Deploying the System Design of a Text-to-Video Generation System
Understand the System Design of a text-to-video generation system.
In the previous lesson, we chose a model similar to
Let’s start with the storage estimation:
Storage estimation
Storage estimation includes model size, user profile and related data, and indexing storage. Let’s estimate all these resources considering 100 million daily active users:
Model size estimation: We are considering a similar model to Mochi 1, which has approximately 10 billion parameters. For FP16 floating-point precision, the model size becomes:
Mochi 1 uses the T5-XXL encoder for text encoding, which has approximately 4.7 billion parameters. Assuming the same FP16 precision, its size becomes
User profile data: Assume that each user’s data takes approximately 10 KB, translating to the following:
Note: The model and the encoder sizes (29.4 GB) and user profile data (1 TB) will remain constant unless we upgrade the model or the number of users increases. Therefore, we don’t include them in the subsequent storage estimation.
User interaction data: If we store each user interaction data, the storage per interaction will depend on each interaction’s size and the video produced. Assume that the model is set to generate a 5-second video with a resolution of 480p at a frame-rate of 30 frames per second for a single request. A single 480p frame at a standard widescreen aspect ratio of 16:9 translates to around
. This results in a size of for a single generated video. Suppose that each user interacts 10 times per day with the system, so for 100 million users, the storage required would be:1.25 MB FileSize
Indexing storage: To make the user interactions searchable and efficiently accessible, additional storage for indexing and metadata would be required, adding up to 25% of storage.
Note: Other than indexing, redundant storage is required for improved availability, low latency reading, and maybe even to serve videos of varied quality. Real-world systems tend to have much higher storage needs.
So, the total storage requirement per day would be:
According to the estimates given above, the monthly storage requirement is:
Inference servers estimation
At 100 million DAUs, each user generating around 10 videos daily, the estimated Total Requests Per Second (TRPS) is approximately 11574. Using our proposed inference formula for generating a 5-second video at 30 frames per second, an average query’s inference time is 0.75 seconds. We assume the model takes 50 iterations per frame, resulting in 1500 iterations (
According to this estimation, the QPS for NVIDIA server having A100 GPU will be