...
/Back-of-the-envelope Calculations for the Model Deployment
Back-of-the-envelope Calculations for the Model Deployment
Understand the back-of-the-envelope calculations required for estimating various types of resources for model deployment.
Previously, we devised a formula to estimate the model training time and explored its application in inference tasks. Building on that foundation, this lesson takes a practical approach to resource estimation for deploying large language models (LLMs). We’ll perform quick, back-of-the-envelope calculations to assess the key resources required for deployment. Here’s how we’ll break it down:
Storage estimation, including:
Model storage
User profile data
User interaction data
Indexing storage
Inference server estimation
Network bandwidth estimation
With these calculations, we aim to estimate the resource requirements for deploying LLMs. Let’s start with resource estimation:
Resource estimation
Resource estimation ensures the infrastructure can handle expected workloads efficiently while remaining cost-effective. It helps determine the computing power, storage, and network requirements for scalability, reliability, and performance. Accurate estimation prevents over-provisioning, which increases costs, and under-provisioning, which can lead to system failures or poor user experiences.
Note: Estimating resources for deploying models requires making some initial assumptions. For this course, we assume 100 million daily active users (DAU) which will allow us to estimate key factors like storage, server capacity for inference, and bandwidth requirements, though these numbers will be refined as the system scales.
Storage requirements
The storage requirements include various elements, such as the model’s storage, user profile data, and user interaction data with the model. Depending on the system’s requirements, we may also need to store additional elements like model metadata, context information, logs, prompt templates, and configuration files. While a fully operational system involves managing numerous data components, we will focus on the most essential ones for simplicity.
Let’s start with estimating the model’s storage:
Model storage
Estimating the storage for an ML model is crucial for ensuring it fits within the resource constraints of deployment environments. It helps optimize storage costs, manage inference performance, and plan for model compression. Additionally, accurate estimation prevents issues during deployment due to insufficient storage. The model storage depends on the number of parameters and data precision (data types) used for parameters.
Data precision can be of
For example, for a 3 billion parameters model, using the FP16 precision, the storage required for a model would be 6 GB, as shown below:
In this course, we will use the above formula to estimate the storage required for different models. The following table further details the sizes of different models tailored for different tasks. We compare the intermediate data precisions, such as FP16 and FP32, which offer a balanced trade-off between model size and accuracy:
Models | Storage Required for FP16 | Storage Required for FP32 |
Llama 3.2 (3B) | 6 GB | 12 GB |
Stable Diffusion 3.5 Large (8.1B) | 16.2 GB | 32.4 GB |
Fish Speech (2B) | 4 GB | 8 GB |
Mochi 1 + T5-XXL Encoder (14.7B) | 29.4 GB | 58.8 GB |
The difference between model sizes using FP16 and FP32 can be visualized in the following chart:
Note: Higher precision ensures greater accuracy, but it increases the model’s size, potentially slowing down performance. We can trade off some accuracy to optimize our system for improved performance.
User profile data
The storage required for user profile data depends on the number of users and the storage required per user.
For instance, for storing users’ metadata, assume that each user’s data takes approximately 10 KB, translating to:
User interaction data
If we store data on each user interaction, the storage per interaction will depend on the number of users, the total daily interactions, and the size of each interaction.
Assume that each user interacts 10 times daily with the system, consuming 2 KB of space per interaction. For 100 M users, this storage requirement per day would be:
Indexing storage
We would need additional storage for indexing the user interaction data for fast retrieval. Let’s assume an average storage increase of 25% for indexing.
Note: The model’s size and user profile data will remain constant and require less storage. Therefore, we don’t include them in the indexing storage estimation.
So, the total storage requirement for user interaction per day would be:
Typically, production-grade setups require redundant storage to ensure high availability and reliability. Depending on the data’s sensitivity, multiple variants may be saved in different locations across distributed systems. Therefore, the storage requirements for real systems may multiply.
The following calculator estimates storage for different types of data. You can change the numbers (cells in yellow) and see the change reflected in the final storage.
Storage Estimation for Different Data Types
A | B | C | |
1 | No. of users | 100 | Million |
2 | Size of each user’s data in KB | 10 | KB |
3 | User profile data | f1 | TB |
4 | No. of user interactions per day | 10 | Per day |
5 | Space taken by each interaction | 2 | KB |
6 | Size of the total user interaction data | f2 | TB |
7 | Additional indexing storage (percentage) | 25 | % |
8 | Total estimated storage per day | f2.5 | GB |
According to the above estimates, the monthly storage requirement is:
These calculations are not set in stone; they are rough estimates that serve as a starting point for the design process. The focus here is on outlining the overall approach to System Design, providing a high-level view of the key considerations. These numbers can change significantly as we dive deeper into the actual designs.
Inference servers estimation
To ensure uninterrupted ...