Grokking the Generative AI System Design/

...

/

Back-of-the-envelope Calculations for the Model Deployment

Previously, we devised a formula to estimate the model training time and explored its application in inference tasks. Building on that foundation, this lesson takes a practical approach to resource estimation for deploying large language models (LLMs). We’ll perform quick, back-of-the-envelope calculations to assess the key resources required for deployment. Here’s how we’ll break it down:

Storage estimation, including:
- Model storage
- User profile data
- User interaction data
- Indexing storage
Inference server estimation
Network bandwidth estimation

With these calculations, we aim to estimate the resource requirements for deploying LLMs. Let’s start with resource estimation:

Resource estimation

Resource estimation ensures the infrastructure can handle expected workloads efficiently while remaining cost-effective. It helps determine the computing power, storage, and network requirements for scalability, reliability, and performance. Accurate estimation prevents over-provisioning, which increases costs, and under-provisioning, which can lead to system failures or poor user experiences.

Note: Estimating resources for deploying models requires making some initial assumptions. For this course, we assume 100 million daily active users (DAU) which will allow us to estimate key factors like storage, server capacity for inference, and bandwidth requirements, though these numbers will be refined as the system scales.

Storage requirements

The storage requirements include various elements, such as the model’s storage, user profile data, and user interaction data with the model. Depending on the system’s requirements, we may also need to store additional elements like model metadata, context information, logs, prompt templates, and configuration files. While a fully operational system involves managing numerous data components, we will focus on the most essential ones for simplicity.

Let’s start with estimating the model’s storage:

Model storage

Estimating the storage for an ML model is crucial for ensuring it fits within the resource constraints of deployment environments. It helps optimize storage costs, manage inference performance, and plan for model compression. Additionally, accurate estimation prevents issues during deployment due to insufficient storage. The model storage depends on the number of parameters and data precision (data types) used for parameters.

Data precision can be of FP64 (64-bit)It is often known as a double-precision floating point format. This format uses 64 bits to represent numerical values, offering high precision and a wide range of values. It is primarily used in scientific computations and applications requiring extreme accuracy., FP32 (32-bit)It is a single-precision floating point format widely used in machine learning and graphics; this format balances precision and computational efficiency. It is sufficient for most deep learning tasks and is often the default format., FP16 (16-bit)Also known as half-precision, this format reduces memory usage and speeds up computation by representing values with lower precision. It is commonly used to train and infer machine learning models, especially on specialized hardware like GPUs., and quantized (8-bit)This format represents values using 8 bits, significantly reducing memory and computation requirements. While sacrificing some precision, it effectively deploys machine learning models in resource-constrained environments like mobile devices.. This precision shows the size of one parameter. For example, if we use FP16 (16-bit) precision, the size of one parameter will be 2 bytes. The following formula can be used to estimate the storage required for a model:

Models	Storage Required for FP16	Storage Required for FP32
Llama 3.2 (3B)	6 GB	12 GB
Stable Diffusion 3.5 Large (8.1B)	16.2 GB	32.4 GB
Fish Speech (2B)	4 GB	8 GB
Mochi 1 + T5-XXL Encoder (14.7B)	29.4 GB	58.8 GB

	A	B	C
1	No. of users	100	Million
2	Size of each user’s data in KB	10	KB
3	User profile data	f1	TB
4	No. of user interactions per day	10	Per day
5	Space taken by each interaction	2	KB
6	Size of the total user interaction data	f2	TB
7	Additional indexing storage (percentage)	25	%
8	Total estimated storage per day	f2.5	GB

Introduction to GenAI System Design

Fundamental Concepts in GenAI

Back-of-the-envelope Calculations

Systematic Framework for Designing GenAI Systems

System Design of a Text-to-Text Generation System

ChatGPT

System Design of a Text-to-Image Generation System

DALL·E

System Design of a Text-to-Speech Generation System

ElevenLabs

System Design of a Text-to-Video Generation System

Sora

Conclusion

Back-of-the-envelope Calculations for the Model Deployment

Resource estimation

Storage requirements

Model storage

User profile data

User interaction data

Indexing storage

Storage Estimation for Different Data Types

Inference servers estimation