We must utilize model quantization, pruning, and lower floating-point precision to reduce memory and processing power needs. In this lesson, we will focus on the System Design of the text-to-speech generation system, which should take textual input and produce audio speech in different styles aligned with the user’s intent. We will base our System Design on the model selection and training infrastructure discussed in the previous lesson. Specifically, we will estimate the necessary resources for deploying a model similar to the Fish Speech model and design the system architecture required to efficiently run the model in a production environment.

Let’s start the journey with the model size estimation.

Text-to-speech model size estimation

Due to the complexity of the data they process and the nature of their tasks, text-to-speech models generally have fewer parameters than text-to-image generation models. Therefore, we assume that the number of parameters is 2 billionThe number of parameters in Fish Speech 1.4 has not yet been revealed. However, many modern text-to-speech models, such as those based on Tacotron or FastSpeech, tend to have parameters in the range of 2 billion..

Let’s consider the FP32 floating-point data format for the model, which gives us a size of 8 GB, as calculated below:

Get hands-on with 1300+ tech skills courses.