Grokking the Generative AI System Design/

...

Inference Optimization in GenAI Systems

Learn key inference optimization techniques, such as quantization, pruning, and batching, to build scalable and efficient generative AI systems for real-world applications.

We'll cover the following...

What and why of inference optimization
Inference optimization methods
Conclusion

Machine learning (ML) models are trained to make predictions and generate output based on some input. Inference in ML refers to providing live data or information to a trained model to see how it recognizes patterns, makes predictions, or solves a task. Inference lets one know how well a model responds to new data after training. This may include testing the speed of the inference or evaluating the model’s outputs.

Assuming you are satisfied with the output it generates through Inference, the next step is to scale it and impress upon the users how well you developed and trained the model. That’s where the challenge lies because the users are concerned with the accuracy and overall performance, such as latency, availability, and scalability, not to mention the cost and energy metrics at the service provider’s end.

What and why of inference optimization

Inference optimization is the process of optimizing inference. i.e., improving the speed, scale, and efficiency of the AI system without compromising on the accuracy of the result. This is necessary in building services for production environments to handle large user traffic, especially during peak hours.

Press + to interact

Now that we understand that inference optimization is necessary for powering real-time generative AI applications, let’s understand the approaches typically used to optimize inference.

Inference optimization methods

Let’s look at the common methods to optimize inference, starting with model quantization.

Quantization

Quantization is reducing the detail represented in numbers used by a model. ML models use high-precision numbers to make accurate predictions. Using quantization, numbers are rounded off to maintain enough detail but reduce size. For example, a value of 2.311 is rounded off to 2.3 or 2, maintaining good data originality.

In the model quantization process, the ...

Introduction to GenAI System Design

Fundamental Concepts in GenAI

Back-of-the-envelope Calculations

Systematic Framework for Designing GenAI Systems

System Design of a Text-to-Text Generation System

ChatGPT

System Design of a Text-to-Image Generation System

DALL·E

System Design of a Text-to-Speech Generation System

ElevenLabs

System Design of a Text-to-Video Generation System

Sora

Conclusion

Inference Optimization in GenAI Systems

What and why of inference optimization

Inference optimization methods

Quantization