...

/

Back-of-the-envelope Calculations for the Model Deployment

Back-of-the-envelope Calculations for the Model Deployment

Understand the back-of-the-envelope calculations required for estimating various types of resources for model deployment.

Previously, we devised a formula to estimate the model training time and explored its application in inference tasks. Building on that foundation, this lesson takes a practical approach to resource estimation for deploying large language models (LLMs). We’ll perform quick, back-of-the-envelope calculations to assess the key resources required for deployment. Here’s how we’ll break it down:

  • Storage estimation, including:

    • Model storage

    • User profile data

    • User interaction data

    • Indexing storage

  • Inference server estimation

  • Network bandwidth estimation

With these calculations, we aim to estimate the resource requirements for deploying LLMs. Let’s start with resource estimation:

Resource estimation

Resource estimation ensures the infrastructure can handle expected workloads efficiently while remaining cost-effective. It helps determine the computing power, storage, and network requirements for scalability, reliability, and performance. Accurate estimation prevents over-provisioning, which increases costs, and under-provisioning, which can lead to system failures or poor user experiences.

Note: Estimating resources for deploying models requires making some initial assumptions. For this course, we assume 100 million daily active users (DAU) which will allow us to estimate key factors like storage, server capacity for inference, and bandwidth requirements, though these numbers will be refined as the system scales.

Storage requirements

The storage requirements include various elements, such as the model’s storage, user profile data, and user interaction data with the model. Depending on the system’s requirements, we may also need to store additional elements like model metadata, context information, logs, prompt templates, and configuration files. While a fully operational system involves managing numerous data components, we will focus on the most essential ones for simplicity.

Let’s start with estimating the model’s storage:

Model storage

Estimating the storage for an ML model is crucial for ensuring it fits within the resource constraints of deployment environments. It helps optimize storage costs, manage inference performance, and plan for model compression. Additionally, accurate estimation prevents issues during deployment due to insufficient storage. The model storage depends on the number of parameters and data precision (data types) used for parameters.

Data precision can be of FP64 (64-bit)It is often known as a double-precision floating point format. This format uses 64 bits to represent numerical values, offering high precision and a wide range of values. It is primarily used in scientific computations and applications requiring extreme accuracy., FP32 (32-bit)It is a single-precision floating point format widely used in machine learning and graphics; this format balances precision and computational efficiency. It is sufficient for most deep learning tasks and is often the default format., FP16 (16-bit)Also known as half-precision, this format reduces memory usage and speeds up computation by representing values with lower precision. It is commonly used to train and infer machine learning models, especially on specialized hardware like GPUs., and quantized (8-bit)This format represents values using 8 bits, significantly reducing memory and computation requirements. While sacrificing some precision, it effectively deploys machine learning models in resource-constrained environments like mobile devices.. This precision shows the size of one parameter. For example, if we use FP16 (16-bit) precision, the size of one parameter will be 2 bytes. The following formula can be used to estimate the storage required for a model:

For example, for a 3 billion parameters model, using the FP16 precision, the storage required for a model would be 6 GB, as shown below:

In this course, we will use the above formula to estimate the storage required for different models. The following table further details the sizes of different models tailored for different tasks. We compare the intermediate data precisions, such as FP16 and FP32, which offer a balanced trade-off between model size and accuracy:

Models

Storage Required for FP16

Storage Required for FP32

Llama 3.2 (3B)

6 GB

12 GB

Stable Diffusion 3.5 Large (8.1B)

16.2 GB

32.4 GB

Fish Speech (2B)

4 GB

8 GB

Mochi 1 + T5-XXL Encoder (14.7B)

29.4 GB

58.8 GB

The difference between model sizes using FP16 and FP32 can be visualized in the following chart:

Note: Higher precision ensures greater accuracy, but it increases the model’s size, potentially slowing down performance. We can trade off some accuracy to optimize our system for improved performance.

User profile data

The storage required for user profile data depends on the number of users and the storage required per user.

For instance, for storing users’ metadata, assume that each user’s data takes approximately 10 KB, translating to:

User interaction data

If we store data on each user interaction, the storage per interaction will depend on the number of users, the total daily interactions, and the size of each interaction.

Assume that each user interacts 10 times daily with the system, consuming 2 KB of space per interaction. For 100 M users, this storage requirement per day would be:

Indexing storage

We would need additional storage for indexing the user interaction data for fast retrieval. Let’s assume an average storage increase of 25% for indexing.

Note: The model’s size and user profile data will remain constant and require less storage. Therefore, we don’t include them in the indexing storage estimation.

So, the total storage requirement for user interaction per day would be:

Press + to interact
Total storage consists of the user profile and interaction data, and indexing storage
Total storage consists of the user profile and interaction data, and indexing storage

Typically, production-grade setups require redundant storage to ensure high availability and reliability. Depending on the data’s sensitivity, multiple variants may be saved in different locations across distributed systems. Therefore, the storage requirements for real systems may multiply.

The following calculator estimates storage for different types of data. You can change the numbers (cells in yellow) and see the change reflected in the final storage.

Storage Estimation for Different Data Types

ABC
1No. of users100Million
2Size of each user’s data in KB10KB
3User profile dataf1TB
4No. of user interactions per day10Per day
5Space taken by each interaction2KB
6Size of the total user interaction dataf2TB
7Additional indexing storage (percentage)25%
8Total estimated storage per dayf2.5GB

According to the above estimates, the monthly storage requirement is:

These calculations are not set in stone; they are rough estimates that serve as a starting point for the design process. The focus here is on outlining the overall approach to System Design, providing a high-level view of the key considerations. These numbers can change significantly as we dive deeper into the actual designs.

Inference servers estimation

To ensure uninterrupted ...