...

/

Deploying the System Design of a Text-to-Speech Generation Model

Deploying the System Design of a Text-to-Speech Generation Model

Understand the System Design of a text-to-speech generation system, focusing on detailed components like prompt embedding, dynamic contextualizer, and model host management system.

In this lesson, we will focus on the System Design of the text-to-speech generation system, which should take textual input and produce audio speech in different styles aligned with the user’s intent. We will base our System Design on the model selection and training infrastructure discussed in the previous lesson. Specifically, we will estimate the necessary resources for deploying a model similar to the Fish Speech model and design the system architecture required to efficiently run the model in a production environment.

Let’s start the journey with storage estimation:

Storage estimation

Storage estimation includes model size, user profile and related data, along with indexing storage. Let’s estimate all these resources considering 100 million daily active users:

  • Model size estimation: Due to the complexity of the data they process and the nature of their tasks, text-to-speech models generally have fewer parameters than text-to-image generation models. Therefore, we assume that the number of parameters is 2 billionThe number of parameters in Fish Speech 1.4 has not yet been revealed. However, many modern text-to-speech models, such as those based on Tacotron or FastSpeech, tend to have parameters in the range of 2 billion.. Let’s consider the FP16 floating-point data format for the model, which gives us a size of 4 GB, as calculated below:

  • User profile data: For storing users’ data, assume that each user’s data takes approximately 10 KB, translating to:

Note: The model's size (4 GB) and user profile data (1 TB) will remain constant unless the number of users increases. Therefore, we don't include them in the storage required per day or month.

  • User interaction data: If we store each user interaction data, the storage per interaction will depend on the size of each interaction and the generated audio size produced by the model. Assuming each user interacts with the system 10 times daily, the model generates 10 seconds of audio per interaction, with an audio size of 0.2 MB per interactionThe data produced at a rate of 128 kbps uses around 1 MB per minute, which translates into approximately 0.02 MB per second, which becomes 0.2 MB for a 10-second audio clip.. To calculate the storage required for user interaction data, the calculations proceed as follows:

Note: The 200 TB of storage is required when the audio is produced at 128 kbps. This number is reduced to 80 TBAudioRate64Kbps if the audio is produced at 64 kbps.

  • Indexing storage: To make user interactions searchable and efficiently accessible, additional storage for indexing would be required. This might add up to 25% of storage.

So, the total storage requirement per day would be:

According to the above estimates, the monthly storage requirement is:

Inference servers estimation

At 100 million DAUs, each user generating around 1 billion audio speeches, the estimated Total Requests Per Second (TRPS) is approximately 11574. We assumed that the model generates 160 samples per iteration at a frequency of 16 KHz, resulting in 1000 iterations (C\text{C}) per 10-second video. Based on our proposed inference formula, the estimated inference time for generating audio of 10 seconds using a 2 billion model is approximately 12.8 milliseconds. This is the estimated inference time when using FP16 precisionFor FP16 precision the NVIDIA A100 GPU speed is 312 TFLOPS. on an NVIDIA A100 GPU. The calculation steps are shown below:

According to this estimation, the QPS for NVIDIA server having A100 GPU will be 78.125 QPS=(1/0.0128), which yields us the following number of GPUs:

We need approximately 148 GPUs to convert 11574 text-to-speech requests per second.

Bandwidth estimation

To serve 100 million users uninterruptedly, we need to estimate the required ingress and egress bandwidths. The ingress bandwidth depends on the size of the user request. Assuming that each request takes approximately 2 KB of size, the ingress bandwidth can be estimated for 11,574 requests as follows:

The egress bandwidth depends on the size of the response. For 10-second audio having a size of 0.2 MB, the total response size is estimated to be 0.3 MB, including the associated metadata. This gives us the following egress bandwidth for a total of 11574 responses:

So, we have the following number of resource estimations for inference:

  • Storage required: 250 TB/day\text{250 TB/day}

  • GPUs required to process 1157411574 ...