...
/Deploying the System Design of a Text-to-Video Generation System
Deploying the System Design of a Text-to-Video Generation System
Understand the System Design of a text-to-video generation system.
In the previous lesson, we chose a model similar to
Let’s start with the model size estimation:
Text-to-video model size estimation
We are considering a similar model to Mochi 1, which has approximately 10 billion parameters. For FP32 floating-point precision, the model size becomes:
Mochi 1 uses the T5-XXL encoder for text encoding, which has approximately 4.7 billion parameters. Assuming the same FP32 precision, its size becomes
Now, let’s estimate resouces for deploying the text-to-video generation model before discussing the System Design.
Resource estimation
Estimating resources for deploying models requires making some initial assumptions. For instance, assuming 100 million daily active users (DAU) allows us to estimate key factors like storage, server capacity for inference, and bandwidth requirements, though these numbers will update as the system scales.
Storage estimation
Let’s estimate the basic storage required for users profile and interaction data:
User profile data: Assume that each user’s data takes approximately 10 KB, translating to the following:
Note: The model and the encoder sizes (58 GB) and user profile data (1 TB) will remain constant unless we upgrade the model or the number of users increases. Therefore, we don’t include them in the subsequent storage estimation.
User interaction data: If we store each user interaction data, the storage per interaction will depend on each interaction’s size and the video produced. Assume that the model is set to generate a 5-second video with a resolution of 480p at a framerate of 30 frames per second for a single request. A single 480p frame at a standard widescreen aspect ratio of 16:9 translates to around
. This results in a size of for a single generated video. Suppose that each user interacts 10 times per day with the system, so for 100 million users, the storage required would be:1.25 MB FileSize
Additional storage: To make the user interactions searchable and efficiently accessible, additional storage for indexing and metadata would be required, adding up to 25% of storage.
Note: Other than indexing, redundant storage is required for improved availability, low latency reading, and maybe even to serve videos of varied quality. Real-world systems tend to have much higher storage needs.
So, the total storage requirement per day would be:
The following calculator estimates storage for different data types. We can change the number and see the change in the final storage requirement.
Storage Estimation for Different Types of Data
A | B | C | |
1 | No. of users in millions | 100 | Million |
2 | Size of each user’s data in KB | 10 | KB |
3 | User profile data | f1 | TB |
4 | No. of interactions per day for each user | 10 | Per day |
5 | Space taken by each interaction | 1.25 | MB |
6 | Size of the total user interaction data | f1250 | TB |
7 | Indexing storage percentage | 25 | % |
8 | Model size in GB | 58.8 | GB |
9 | Total required storage | f1562.5 | TB |
According to the estimates given above, the monthly storage requirement is:
Inference servers estimation
For 100 million DAUs, each user generates around 10 videos daily, so the estimated total requests per second (TRPS) is 11,574.
Total requests per day:
TRPS =
...