...

/

Deploying the System Design of a Text-to-Video Generation System

Deploying the System Design of a Text-to-Video Generation System

Understand the System Design of a text-to-video generation system.

In the previous lesson, we chose a model similar to Mochi 1https://huggingface.co/genmo/mochi-1-preview for the text-to-video generation system and presented the training process and the required resources. In this lesson, our focus is on the deployment infrastructure for such a model. We estimate various resources, followed by design considerations and a detailed System Design.

Let’s start with the model size estimation:

Text-to-video model size estimation

We are considering a similar model to Mochi 1, which has approximately 10 billion parameters. For FP32 floating-point precision, the model size becomes:

Mochi 1 uses the T5-XXL encoder for text encoding, which has approximately 4.7 billion parameters. Assuming the same FP32 precision, its size becomes 18.8 GB4.7 x 10^9 x 4 = 18.8 GB. So overall, we need 58.8 GB to store both the T5-XXL and Mochi 1 models.

Now, let’s estimate resouces for deploying the text-to-video generation model before discussing the System Design.

Resource estimation

Estimating resources for deploying models requires making some initial assumptions. For instance, assuming 100 million daily active users (DAU) allows us to estimate key factors like storage, server capacity for inference, and bandwidth requirements, though these numbers will update as the system scales.

Storage estimation

Let’s estimate the basic storage required for users profile and interaction data:

  • User profile data: Assume that each user’s data takes approximately 10 KB, translating to the following:

Note: The model and the encoder sizes (58 GB) and user profile data (1 TB) will remain constant unless we upgrade the model or the number of users increases. Therefore, we don’t include them in the subsequent storage estimation.

  • User interaction data: If we store each user interaction data, the storage per interaction will depend on each interaction’s size and the video produced. Assume that the model is set to generate a 5-second video with a resolution of 480p at a framerate of 30 frames per second for a single request. A single 480p frame at a standard widescreen aspect ratio of 16:9 translates to around 854×480=409,920 pixels854\times480=409,920 \text{ pixels}. This results in a size of 1.25 MBFileSize for a single generated video. Suppose that each user interacts 10 times per day with the system, so for 100 million users, the storage required would be:

  • Additional storage: To make the user interactions searchable and efficiently accessible, additional storage for indexing and metadata would be required, adding up to 25% of storage.

Note: Other than indexing, redundant storage is required for improved availability, low latency reading, and maybe even to serve videos of varied quality. Real-world systems tend to have much higher storage needs.

So, the total storage requirement per day would be:

The following calculator estimates storage for different data types. We can change the number and see the change in the final storage requirement.

Storage Estimation for Different Types of Data

ABC
1No. of users in millions100Million
2Size of each user’s data in KB10KB
3User profile dataf1TB
4No. of interactions per day for each user10Per day
5Space taken by each interaction1.25MB
6Size of the total user interaction dataf1250TB
7Indexing storage percentage25%
8Model size in GB58.8GB
9Total required storagef1562.5TB

According to the estimates given above, the monthly storage requirement is:

Inference servers estimation

For 100 million DAUs, each user generates around 10 videos daily, so the estimated total requests per second (TRPS) is 11,574.

  • Total requests per day: 100 M×10=1 billion\text{100 M} \times 10 = 1 \text{ billion}

  • TRPS = ...

Access this course and 1400+ top-rated courses and projects.