Grokking the Generative AI System Design/

...

Evaluation Metrics for GenAI Systems

Learn the evaluation metrics used to measure the performance of Generative AI models.

We'll cover the following...

Automatic metrics
Human evaluation
Conclusion

Evaluating generative AI models involves assessing their performance and effectiveness in creating new content, such as text, images, audio, or videos. Unlike traditional AI systems that analyze existing (training) data to classify information or predict outcomes, generative models produce novel outputs that should be coherent, creative, and relevant to the input or context.

Generative AI model evaluation is crucial today because:

Evaluation ensures the generated content is meaningful, accurate, and meets the required standards. For example, the output must be fluent, contextually relevant, and error-free in text generation.
Evaluation metrics help compare the GenAI system’s effectiveness, guiding developers in choosing or improving models for specific applications.
Proper evaluation can highlight biases or harmful content in generated outputs, ensuring the ethical deployment of generative systems.

Evaluation is critical in avoiding mode collapsehttps://developers.google.com/machine-learning/gan/problems#mode-collapse, where the model produces repetitive or low-diversity outputs, reducing its practical utility.

Evaluation can be conducted in two ways:

Automatic metrics: Quantitative measures that do not require human intervention. Usually, the evaluation metric method scores 0 (worst) to 1 (best).
Human evaluation: Qualitative insights gained by collecting user feedback or expert judgments. Usually, users rate the outputs on a scale of 1 (worst) to 5 (best).

Press + to interact

While these evaluation metrics are not exclusive to GenAI models, they are critical in assessing generated content quality. This discussion will explore some of the most commonly used metrics in GenAI model evaluation and their applications.

Automatic metrics

Automatic metrics rely on computational methods to assess generative AI outputs. They provide fast, consistent, and scalable evaluations without human involvement and are widely used during model training and testing phases. The most common automatic metrics include inception score (IS), Fréchet inception distance (FID), BLEU score, ROUGE score, perplexity, and CLIP score.

Inception score

The inception score uses a pretrained classifier (e.g., Inception v3https://en.wikipedia.org/wiki/Inception_(deep_learning_architecture)) to see how well it can recognize objects in the generated images. The generated images are considered of high quality and diverse if the classifier is confident and finds various objects. A higher IS indicates that the images are both diverse and realistic. However, IS assumes that the classifier aligns well with the dataset, which can limit its applicability.

The inception score is calculated using the following steps:

Finding the label distributionThis shows that a classifier thinks a single generated image is expressed as probabilities for different image classes. It’s like getting the classifier’s “best guess” about the image’s contents..
Finding the marginal distributionThis gives an overall picture of the diversity of a set of generated images, obtained by averaging the individual label distributions. It reveals how well the generated images cover a variety of different objects or concepts..
Calculating the Kullback–Leibler (KL) divergence between the label and marginal distributions.
Taking the exponential of this KL divergence to give us a final score.

Let’s look at all those terms in action:

Press + to interact

Note: While the inception score is primarily used for evaluating image generation, there are a few ways to adapt its core ideas for text generation. Instead of using a pretrained image classifier, we can use a pretrained language model. This language model can analyze the generated text and provide probabilities for different categories or topics.

Fréchet inception distance

FID compares the mean and covariance of feature embeddings from real and generated images. We get these embeddings using a pretrained embedding model (e.g., Inception v3). Unlike IS, it is more sensitive to subtle flaws, such as mode collapse, where a model produces outputs with low diversity. Lower FID scores indicate greater similarity between the two distributions, i.e., better images.

FID can be calculated by:

Generating embeddings of the real and the generated image using an embedding model.
Calculating the mean and the standard deviation of these embedding vectors.
Using the distance formula $d(X,Y) = (\mu_x - \mu_y)^2 + (\sigma_x-\sigma_y)^2$ ...

Introduction to GenAI System Design

Fundamental Concepts in GenAI

Back-of-the-envelope Calculations

Systematic Framework for Designing GenAI Systems

System Design of a Text-to-Text Generation System

ChatGPT

System Design of a Text-to-Image Generation System

DALL·E

System Design of a Text-to-Speech Generation System

ElevenLabs

System Design of a Text-to-Video Generation System

Sora

Conclusion

Evaluation Metrics for GenAI Systems

Automatic metrics

Inception score

Fréchet inception distance