Grokking the Generative AI System Design/

...

Deploying the System Design of a Text-to-Speech Generation Model

Understand the System Design of a text-to-speech generation system, focusing on detailed components like prompt embedding, dynamic contextualizer, and model host management system.

We'll cover the following...

Storage estimation
Inference servers estimation
Bandwidth estimation
Design consideration for the text-to-speech system
System Design of the text-to-speech generation system
- High-level design
  - Achieving functional requirements
- Detailed System Design
Fulfilling nonfunctional requirements
Conclusion

In this lesson, we will focus on the System Design of the text-to-speech generation system, which should take textual input and produce audio speech in different styles aligned with the user’s intent. We will base our System Design on the model selection and training infrastructure discussed in the previous lesson. Specifically, we will estimate the necessary resources for deploying a model similar to the Fish Speech model and design the system architecture required to efficiently run the model in a production environment.

Let’s start the journey with storage estimation:

Storage estimation

Storage estimation includes model size, user profile and related data, along with indexing storage. Let’s estimate all these resources considering 100 million daily active users:

Model size estimation: Due to the complexity of the data they process and the nature of their tasks, text-to-speech models generally have fewer parameters than text-to-image generation models. Therefore, we assume that the number of parameters is 2 billionThe number of parameters in Fish Speech 1.4 has not yet been revealed. However, many modern text-to-speech models, such as those based on Tacotron or FastSpeech, tend to have parameters in the range of 2 billion.. Let’s consider the FP16 floating-point data format for the model, which gives us a size of 4 GB, as calculated below:

Note: The model's size (4 GB) and user profile data (1 TB) will remain constant unless the number of users increases. Therefore, we don't include them in the storage required per day or month.

User interaction data: If we store each user interaction data, the storage per interaction will depend on the size of each interaction and the generated audio size produced by the model. Assuming each user interacts with the system 10 times daily, the model generates 10 seconds of audio per interaction, with an audio size of 0.2 MB per interactionThe data produced at a rate of 128 kbps uses around 1 MB per minute, which translates into approximately 0.02 MB per second, which becomes 0.2 MB for a 10-second audio clip.. To calculate the storage required for user interaction data, the calculations proceed as follows:

Inference servers estimation

At 100 million DAUs, each user generating around 1 billion audio speeches, the estimated Total Requests Per Second (TRPS) is approximately 11574. We assumed that the model generates 160 samples per iteration at a frequency of 16 KHz, resulting in 1000 iterations ( $\text{C}$ ) per 10-second video. Based on our proposed inference formula, the estimated inference time for generating audio of 10 seconds using a 2 billion model is approximately 12.8 milliseconds. This is the estimated inference time when using FP16 precisionFor FP16 precision the NVIDIA A100 GPU speed is 312 TFLOPS. on an NVIDIA A100 GPU. The calculation steps are shown below:

Introduction to GenAI System Design

Fundamental Concepts in GenAI

Back-of-the-envelope Calculations

Systematic Framework for Designing GenAI Systems

System Design of a Text-to-Text Generation System

ChatGPT

System Design of a Text-to-Image Generation System

DALL·E

System Design of a Text-to-Speech Generation System

ElevenLabs

System Design of a Text-to-Video Generation System

Sora

Conclusion

Deploying the System Design of a Text-to-Speech Generation Model

Storage estimation

Inference servers estimation

Bandwidth estimation