Home/Blog/Programming/RAG evaluation with Ragas

RAG evaluation with Ragas

Q: What are the key metrics provided by Ragas for RAG evaluation?

Ragas provides several key metrics for RAG evaluation, including: * **Context precision:** Measures how relevant the retrieved chunks of text are for a given query. * **Context recall:** Evaluates the system's ability to retrieve all relevant information. * **Context entities recall:** Assesses the retrieval of entities in a knowledge graph context. * **Noise sensitivity:** Evaluates how often irrelevant or incorrect information is included in the generated response. * **Response relevancy:** Measures how well the generated response aligns with the original query. * **Faithfulness:** Assesses how accurately the generated response reflects the facts from the retrieved context.

Q: Can Ragas be used to evaluate agent-based AI systems?

Yes, Ragas includes metrics such as topic adherence, tool call accuracy, and agent goal accuracy, which are particularly useful for evaluating agent-based systems. These metrics assess the effectiveness of the agent in achieving its goals and using tools appropriately in a conversation.

Q: What are synthetic test datasets in the context of Ragas?

Synthetic test datasets are artificially generated datasets used to evaluate RAG and agent-based AI systems. These datasets simulate various user queries and scenarios, allowing developers to test their models in a controlled environment before deploying them in real-world applications. Ragas provides tools for generating these synthetic datasets, ensuring comprehensive evaluation.

Q: How can I integrate Ragas into my existing RAG-based applications?

To integrate Ragas into your RAG-based application, you need to install the Ragas Python library and use its built-in evaluation functions to assess the performance of your retriever and generator components. Ragas works seamlessly with existing RAG pipelines and can be used to evaluate and fine-tune your system’s performance.

15 min read

Apr 21, 2025

content

Retrieval-augmented generation process

Building a simple RAG app

Knowledge base for the RAG app

Vector store for document storage and retrieval

Retriever function: Fetching relevant context

Generative model: Answering the question

Testing the RAG app

Methods of evaluating LLM-based applications

Traditional methods

Non-traditional methods

Ragas RAG evaluation metrics

Context precision

Non-LLM-based context precision calculation: Using reference contexts aligned with the input query

LLM-based context precision calculation using reference answer to the input query

LLM-based context precision calculation without using reference answer to the input query

Context recall

Non-LLM-based context recall calculation

LLM-based context recall calculation

Context entities recall

Noise sensitivity

Response relevancy

Faithfulness

Conclusion

That said, a key challenge remains: how can we effectively measure the performance of these advanced AI systems?

This is where LLM evaluation becomes crucial. Ragas, a Python library, provides a comprehensive suite of metrics for evaluating LLM-based applications, ensuring that models retrieve relevant information and generate high-quality responses. In this blog, we’ll explore LLM RAG (simply RAG) and demonstrate how Ragas helps developers refine RAG systems. We’ll build a simple question-answering application using RAG and evaluate its performance with Ragas.

Before diving into RAG evaluation with Ragas, let’s first explore the RAG workflow and its components to determine what needs to be evaluated.

Retrieval-augmented generation process#

In a RAG application, when a user submits a question or prompt, the system retrieves relevant passages from a knowledge base, such as the internet, internal company documents, or other text repositories. These passages are combined with the user’s query to provide additional context, enabling the LLM to generate more accurate and context-aware responses.

Therefore, in RAG, the following things need to be evaluated:

Retrieval quality: How well does the system retrieve relevant and accurate information from the knowledge base? Poor retrieval can result in irrelevant or misleading context for the LLM.
Response quality: How effectively does the LLM generate responses that align with the retrieved context, ensuring relevance and factual accuracy? How reliable is the LLM in producing accurate responses even when the retrieved context is incomplete or irrelevant?

With this understanding of the RAG workflow and its evaluation criteria, we’ll now build a simple question-answering RAG application and evaluate its performance using Ragas. We’ll explore RAG metrics provided by Ragas and use those metrics to evaluate the RAG application.

Building a simple RAG app#

Our RAG app relies on a vector database to store knowledge about a specific topic. It retrieves relevant chunks of information from the database, which the generative model then uses to answer queries about the topic.

Knowledge base for the RAG app#

A knowledge base is a collection of documents/sentences that contains useful information for the application. Our knowledge base consists of multiple statements about the RAG framework, its components, and its applications.

Here’s a Python list representing our knowledge base data, taken from Kaggle.

knowledge_base = [
    "RAG stands for Retrieval-Augmented Generation, a method in natural language processing.",
    "It combines the power of retrieval-based models with generative models to improve response quality.",
    "The retriever fetches relevant documents based on a query.",
    "The generator uses the retrieved documents to generate a coherent and informative response.",
    "This approach leverages large-scale pretrained models for both retrieval and generation tasks.",
    "Fine-tuning RAG involves training both the retriever and generator components.",
    "RAG can be used in various applications, including chatbots, question-answering systems, and more.",
    "The framework was introduced by Facebook AI Research (FAIR) in 2020.",
    "RAG aims to improve the informativeness and accuracy of generated responses.",
    "The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
    "Generative models in RAG are typically based on architectures like BERT, GPT, or T5.",
    "RAG can handle large-scale knowledge bases and provide specific answers to queries.",
    "The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
    "One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
    "Training RAG requires a large and diverse dataset to cover a wide range of possible queries.",
    "RAG has shown significant improvements over traditional retrieval-based or generative models alone.",
    "The architecture of RAG allows it to be fine-tuned for specific tasks or domains.",
    "RAG integrates retrieval and generation in a seamless manner, improving overall system performance.",
    "The use of retrieval-augmented generation helps in reducing hallucinations in generated text.",
    "RAG's design allows it to leverage external knowledge sources effectively.",
    "The application of RAG extends to areas like medical diagnosis, legal advice, and customer support.",
    "By using retrieval-augmented techniques, RAG ensures that the responses are grounded in real data.",
    "The flexibility of RAG makes it suitable for various languages and dialects.",
    "RAG's performance can be enhanced by continuously updating the knowledge base with new information.",
    "Researchers are exploring ways to make RAG more efficient and scalable for real-time applications.",
    "The integration of retrieval and generation in RAG provides a powerful tool for AI developers.",
]

This is the data in the knowledge base to be used by the retriever in RAG (source: https://www.kaggle.com/code/arashnic/rag-with-sentence-and-hugging-face-transformers)

Each item in this list represents a piece of information, often referred to as a document in vector databases. These documents will be indexed and stored as vectors in the database, enabling efficient similarity-based retrieval.

Vector store for document storage and retrieval#

To implement this, we’ll use ChromaDB, a vector database that allows us to store and organize documents as collections and search them efficiently. The vector database allows the retrieval system (the retriever) to fetch the most relevant information based on the query input.

Here’s how we add the knowledge base data to ChromaDB:

This code initializes the ChromaDB client, creates a collection for the knowledge base, and then adds each document from the knowledge base into the collection.

Retriever function: Fetching relevant context#

The retriever’s job is to retrieve the most relevant context for answering a question. Given a query, the retriever searches through the knowledge base and returns the top k most relevant documents. The generator will then use this context to formulate an answer.

We define a function get_context() that queries the vector database and retrieves the top k results for a given question:

In this function:

The question is the input query from the user.
The top_k parameter defines how many relevant documents to retrieve.
The results are then concatenated into a single string of context, which is passed to the next step to generate an answer.

Generative model: Answering the question#

Once we have the context, we need to use a generative model (in this case, OpenAI's GPT-4o mini ) to generate an answer based on the retrieved information. The generate_answer function is used for this purpose:

To use OpenAI's model, you need an OpenAI API key.

import openai
# OpenAI API Key
openai.api_key = "REPLACE_WITH_YOUR_OpenAI_API_KEY"
# Function to generate an answer
def generate_answer(question):
  context = get_context(question)
  print("Retrieved text chunks (Context): ", context)
  prompt = f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"
  response = openai.chat.completions.create(
      model="gpt-4o-mini",
      messages=[{"role": "user", "content": prompt}]
  )
  return response.choices[0].message.content

Query augmentation and response generation

Retrieved text chunks (Context):  The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models. One of the key benefits of RAG is its ability to provide contextually rich and accurate responses. The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer. Fine-tuning RAG involves training both the retriever and generator components. RAG stands for Retrieval-Augmented Generation, a method in natural language processing.
Q: What is the role of the retriever in RAG?
A: The retriever in RAG selects relevant passages to provide contextually rich and accurate responses for the generator to produce an answer.

Retrieved context and the generated answer

We have finished setting up an LLM-based RAG app that we will evaluate using Ragas. Before we use the RAG evaluation metrics provided by Ragas, we need to understand the traditional and LLM-based evaluation methods to better understand these metrics.

Methods of evaluating LLM-based applications#

Evaluation methods for LLM applications can be broadly classified into two types: traditional and non-traditional.

Traditional methods#

Traditional methods rely on analyzing the exact arrangement and order of words and phrases in the text. They compare the generated output to a predefined reference text, often referred to as the ground truth. Metrics such as BLEU, ROUGE, and exact match are typical examples. They measure how closely the model’s predictions align with the expected response by evaluating aspects like word overlap, sequence similarity, and syntactic structure.

In Ragas, these methods are referred to as non-LLM-based methods.

Non-traditional methods#

These methods leverage the semantic understanding and representational capabilities of LLMs to evaluate text. Instead of focusing solely on surface-level word arrangement, they assess the deeper meaning, coherence, and contextual relevance of the text. Non-traditional metrics can function both with a reference text (to gauge how well the meaning aligns) and without one (to assess intrinsic quality).

In Ragas, these methods are referred to as LLM-based methods.

Ragas RAG evaluation metrics#

Now that we understand how LLM-based applications can be evaluated using traditional and non-traditional methods, let’s see how Ragas simplifies this process with its built-in metrics.

The image below highlights the RAG evaluation metrics offered by Ragas. We will understand what each metric measures and how to calculate the score for each one:

where, $v_k = 1$ if the chunk at rank $k$ is relevant; otherwise, $v_k = 0$ .

Rank refers to the position of a chunk in the list of retrieved text chunks, sorted by the retriever’s estimated relevance to the query. This differs from the actual relevance used for RAG evaluation, where $v_k=1$ if the chunk at rank $k$ is truly relevant and $v_k=0$ otherwise.

$\text{Precision@k}$ measures the proportion of relevant chunks among the top $\text{k}$ retrieved chunks. It is calculated as:

Where:

$\text{true\space positives@k}$ represents the number of relevant chunks among the top $k$ retrieved chunks.
$\text{false\space positives@k}$ represents the number of non-relevant chunks among the top $k$ retrieved chunks.

Wondering how we can determine if a retrieved chunk is actually relevant?
Ragas enables the evaluation of context relevance both with and without ground truth. It provides three methods to calculate context precision: two that are reference-based and one that does not require a reference.

Non-LLM-based context precision calculation: Using reference contexts aligned with the input query#

This method uses traditional metrics, such as string similarity, BLEU score, and string presence, as distance measures to assess how closely a retrieved context matches the reference contexts.

from ragas import SingleTurnSample
sample = SingleTurnSample(
    retrieved_contexts=["The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
    "One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
    "The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
    "Fine-tuning RAG involves training both the retriever and generator components.",
    "RAG stands for Retrieval-Augmented Generation, a method in natural language processing."],
    reference_contexts=["The retriever fetches relevant documents based on a query.",
    "The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer."]
)
from ragas.metrics import NonLLMContextPrecisionWithReference
context_precision = NonLLMContextPrecisionWithReference()
score = await context_precision.single_turn_ascore(sample)
print(f"Non-LLM-based context precision with reference: {score}")

Line 1: We import the SingleTurnSample class from Ragas library, which represents a single interaction typically consisting of a query, the retrieved contexts in response to it, and the reference contexts used for evaluation.
Lines 3–11: We create an evaluation sample that includes retrieved contexts and reference contexts needed to calculate the context precision score. Depending on the requirements of a specific metric, the sample can also include other inputs, such as user input, generated response, or reference response. We will see such samples as we progress.
Line 13: We import the specific metric from the Ragas library that we want to use for calculating the evaluation score.
Line 15 We create an instance of the NonLLMContextPrecisionWithReference class. This instance will be used to compute the context precision score.
Line 16: We call single_turn_ascore method on the metric instance, passing the evaluation sample (sample) as input. This method computes the context precision score for the retrieved contexts using the provided reference contexts. The computation uses non-LLM-based distance measures like string similarity, BLEU, or exact match.
Line 18: We print the computed precision score to show how well the retrieved contexts match the reference contexts.

Output:

Only the retrieved context at rank 3 matches one of the two reference contexts, while the remaining four are irrelevant. So $\text{Precision@3} = 1/3$ and for $\text{k} = 1, 2, 4, 5$ , $\text{Precision@k}=0$ . It follows that $\text{Context Precision@5} = 1/3$ , which is what we see in the output above.

LLM-based context precision calculation using reference answer to the input query#

In this method, LLM is used to determine the relevance of a retrieved context by comparing it with a reference answer. For LLM-based evaluation methods, we need to set up the LLM that Ragas will use for evaluation.

from ragas import SingleTurnSample
# A sample from evaluation dataset
sample = SingleTurnSample(
    user_input =  "What is the role of the retriever in RAG?",
    reference = "The role of the retriever in RAG is to fetch relevant passages based on a query for the generator to produce an answer.",
    retrieved_contexts = ["The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
    "One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
    "The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
    "Fine-tuning RAG involves training both the retriever and generator components.",
    "RAG stands for Retrieval-Augmented Generation, a method in natural language processing."],
)
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
# Set OpenAI API key
import os
os.environ["OPENAI_API_KEY"] = "REPLACE_WITH_YOUR_OpenAI_API_KEY"
# Initialize the OpenAI LLM and wrap it
llm = ChatOpenAI(model="gpt-4")
wrapped_llm = LangchainLLMWrapper(llm)
from ragas.metrics import LLMContextPrecisionWithReference
context_precision = LLMContextPrecisionWithReference(llm=wrapped_llm)
score = await context_precision.single_turn_ascore(sample)
print(f"LLM-based context precision with reference: {score}")

Lines 1–11: We define an evaluation sample using the SingleTurnSample class, which includes the user input, reference response, and retrieved contexts. This sample provides the data needed for evaluation.
Lines 13–20: We import the necessary libraries to set up the LLM provided by OpenAI. We configure the API key to authenticate our application with OpenAI services, enabling access to their LLM. Once configured, we initialize the OpenAI LLM, specifying the gpt-4 model. Finally, we wrap this LLM with the LangchainLLMWrapper to make it compatible with the Ragas framework.
Lines 22–26: We import the LLMContextPrecisionWithReference metric from Ragas, instantiate it with the wrapped LLM, and calculate the context precision score for the given sample. The resulting score measures how well the retrieved contexts align with the reference using LLM-based evaluation.

Output:

from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithoutReference
# A sample from evaluation dataset
sample = SingleTurnSample(
    user_input =  "What is the role of the retriever in RAG?",
    response = "The retriever in RAG selects relevant passages to provide contextually rich and accurate responses for the generator to produce an answer.",
    retrieved_contexts = ["The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
    "One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
    "The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
    "Fine-tuning RAG involves training both the retriever and generator components.",
    "RAG stands for Retrieval-Augmented Generation, a method in natural language processing."],
)
# Set your OpenAI API key
import os
os.environ["OPENAI_API_KEY"] = "REPLACE_WITH_YOUR_OpenAI_API_KEY"
# Initialize the OpenAI LLM and wrap it
llm = ChatOpenAI(model="gpt-4", temperature=0)
wrapped_llm = LangchainLLMWrapper(llm)
context_precision = LLMContextPrecisionWithoutReference(llm=wrapped_llm)
score = await context_precision.single_turn_ascore(sample)
print(f"LLM-based context precision without reference: {score}")

The retrieved context at rank 3 closely matches the LLM’s response the the query.

Context recall#

Context recall measures how many relevant text chunks are successfully retrieved. The relevance of retrieved chunks is determined by predefined reference contexts, which serve as the ground truth. The score ranges from $0$ to $1$ , where $1$ means that every reference context has at least one matching retrieved chunk.

A higher recall score indicates that fewer relevant items (as defined by the reference contexts) were missed during retrieval.

Non-LLM-based context recall calculation#

This method uses simple string comparison methods to check if a reference context matches any of the retrieved text chunks. The formula for calculating context recall in this method is as follows:

from ragas import SingleTurnSample
from ragas.metrics import NonLLMContextRecall
sample = SingleTurnSample(
    retrieved_contexts=["The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
    "One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
    "The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
    "Fine-tuning RAG involves training both the retriever and generator components.",
    "RAG stands for Retrieval-Augmented Generation, a method in natural language processing."],
    reference_contexts=["The retriever fetches relevant documents based on a query.",
    "The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer."]
)
context_recall = NonLLMContextRecall()
score = await context_recall.single_turn_ascore(sample)
print(f"Non-LLM-based context recall with reference contexts: {score}")

Out of the two reference contexts, one has a matching retrieved context.

Challenge: Annotating reference contexts can be a time-consuming task.
Ragas simplifies this by using a reference answer as a proxy for multiple reference contexts! To determine the context recall, reference answer is split into claims. Claims are individual assertions that are self-contained and convey a fact or proposition that can be verified or attributed to a particular source or context. For each claim, we determine whether it is supported by the retrieved contexts. Ideally, every claim in the reference answer should be supported by the retrieved context.

LLM-based context recall calculation#

In this method, the relevance between a claim and the retrieved context is determined using LLM. The formula for calculating context recall in this method is as follows:

from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import LLMContextRecall
# Set your OpenAI API key
import os
os.environ["OPENAI_API_KEY"] = "REPLACE_WITH_YOUR_OpenAI_API_KEY"
# Initialize the OpenAI LLM and wrap it
llm = ChatOpenAI(model="gpt-4", temperature=0)
wrapped_llm = LangchainLLMWrapper(llm)
sample = SingleTurnSample(
    user_input="What is the role of the retriever in RAG?",
    reference="The role of the retriever in RAG is to fetch relevant passages based on a query for the generator to produce an answer.",
    response="The retriever in RAG selects relevant passages to provide contextually rich and accurate responses for the generator to produce an answer.",
    retrieved_contexts=["The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
    "One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
    "The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
    "Fine-tuning RAG involves training both the retriever and generator components.",
    "RAG stands for Retrieval-Augmented Generation, a method in natural language processing."],
)
context_recall = LLMContextRecall(llm=wrapped_llm)
score = await context_recall.single_turn_ascore(sample)
print(f"LLM-based context recall with reference answer: {score}")

All possible claims in the reference answer can be attributed to the third retrieved context.

Context entities recall#

Context entities recall is a metric that evaluates the proportion of entities from the reference answer that are successfully retrieved in the context.

Entities are distinct, meaningful units of information in a text. They typically refer to people, places, organizations, concepts, or objects. For example, in the sentence “Marie Curie won the Nobel Prize for her work on radioactivity,” the extracted entities could be Marie Curie (Person), Nobel Prize (Award), and Radioactivity (Scientific Concept).

The context entity recall metric uses the wrapped LLM to extract entities from both the reference and the retrieved contexts during evaluation, enabling a structured comparison.

Following is the formula to compute context entities recall metric:

from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas import SingleTurnSample
from ragas.metrics import ContextEntityRecall
# Set your OpenAI API key
import os
os.environ["OPENAI_API_KEY"] = "REPLACE_WITH_YOUR_OpenAI_API_KEY"
# Initialize the OpenAI LLM and wrap it
llm = ChatOpenAI(model="gpt-4", temperature=0)
wrapped_llm = LangchainLLMWrapper(llm)
# Create a sample for evaluation
sample_1 = SingleTurnSample(
    reference="The role of the retriever in RAG is to fetch relevant passages based on a query for the generator to produce an answer.",
    retrieved_contexts=["The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
    "One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
    "The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
    "Fine-tuning RAG involves training both the retriever and generator components.",
    "RAG stands for Retrieval-Augmented Generation, a method in natural language processing."],
)
# Create another sample assuming that the retreived contexts are same as reference contexts; epxecting a higher context entities recall score
sample_2 = SingleTurnSample(
    reference="The role of the retriever in RAG is to fetch relevant passages based on a query for the generator to produce an answer.",
    retrieved_contexts=["The retriever fetches relevant documents based on a query.",
    "The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer."],
)
# Initialize the ContextEntityRecall scorer with the wrapped LLM
scorer = ContextEntityRecall(llm=wrapped_llm)
# Calculate the score
score_1 = await scorer.single_turn_ascore(sample_1)
print(f"LLM-based context entities recall with reference answer: {score_1}")
score_2 = await scorer.single_turn_ascore(sample_2)
print(f"LLM-based context entities recall assuming rerieved contexts are same as the reference contexts: {score_2}")

For sample 1, the result shows that one-third of the entities in the reference answer match those in the retrieved contexts. While it might seem that most of the entities from the reference answer are present in the retrieved contexts, the outcome entirely depends on the entity extraction method used behind the scenes.

For sample 2, the result shows that most of the entities in the reference answer match the entities in the retrieved contexts.

So far, we have examined the evaluation metrics for the context retriever. Now, we will examine the evaluation metrics for the response generator, which is essentially an LLM.

Noise sensitivity#

Noise sensitivity is a metric used to evaluate how often a system produces incorrect responses when utilizing either relevant or irrelevant retrieved contexts. It is computed based on the user input, reference answer, generated response, and retrieved contexts.

To calculate noise sensitivity, the generated response is broken down into individual claims, and each claim is evaluated for correctness. Claims (response statements) that are not supported by the reference answer are classified as incorrect claims.

Noise sensitivity is calculated under the assumption that the retrieved contexts are relevant. In such cases, any incorrect claims in the response are attributed to the generator rather than the retriever. The formula for calculating noise sensitivity when the retrieved contexts are relevant is as follows:

from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas import SingleTurnSample
from ragas.metrics import NoiseSensitivity
# Set your OpenAI API key
import os
os.environ["OPENAI_API_KEY"] = "REPLACE_WITH_YOUR_OpenAI_API_KEY"
# Initialize the OpenAI LLM and wrap it
llm = ChatOpenAI(model="gpt-4", temperature=0)
wrapped_llm = LangchainLLMWrapper(llm)
# Creating the evaluation sample from the original example
sample = SingleTurnSample(
    user_input =  "What is the role of the retriever in RAG?",
    reference ="The role of the retriever in RAG is to fetch relevant passages based on a query for the generator to produce an answer.",
    response = "The retriever in RAG selects relevant passages to provide contextually rich and accurate responses for the generator to produce an answer.",
    retrieved_contexts=["The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
    "One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
    "The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
    "Fine-tuning RAG involves training both the retriever and generator components.",
    "RAG stands for Retrieval-Augmented Generation, a method in natural language processing."],
)
scorer = NoiseSensitivity(llm=wrapped_llm)
score= await scorer.single_turn_ascore(sample)
print(f"Noise sensitivity: {score}")

A noise sensitivity value of zero indicates that the system has generated a correct response.

Response relevancy#

Response relevancy measures how well the generated response aligns with the given prompt. Responses that are incomplete or include unnecessary/redundant details receive lower scores, while more relevant and precise answers achieve higher scores.

This metric is determined by generating a set of artificial questions from the response (essentially reverse-engineering it) and calculating the average cosine similarity between these questions and the original user input. A higher average similarity indicates better relevancy. Following is the formula to calculate answer relevancy:

from langchain_openai import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas import SingleTurnSample
from ragas.metrics import ResponseRelevancy
# Set your OpenAI API key
import os
os.environ["OPENAI_API_KEY"] = "REPLACE_WITH_YOUR_OpenAI_API_KEY"
# Initialize the OpenAI LLM and wrap it
llm = ChatOpenAI(model="gpt-4", temperature=0)
wrapped_llm = LangchainLLMWrapper(llm)
# Initialize the embeddings model
embeddings = OpenAIEmbeddings()  # Uses OpenAI embeddings by default
sample = SingleTurnSample(
        user_input="What is the role of the retriever in RAG?",
        response="The retriever in RAG selects relevant passages to provide contextually rich and accurate responses for the generator to produce an answer.",
        retrieved_contexts=["The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
        "One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
        "The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
        "Fine-tuning RAG involves training both the retriever and generator components.",
        "RAG stands for Retrieval-Augmented Generation, a method in natural language processing."]
)
scorer = ResponseRelevancy(llm=wrapped_llm, embeddings=embeddings)
score = await scorer.single_turn_ascore(sample)
print(f"Respone relevancy: {score}")

A response relevancy score of 0.979 indicates that the response is relevant to the given prompt.

Faithfulness#

Faithfulness is a metric that evaluates how consistent the generated answer is with the given context, in terms of factual accuracy. The score ranges from 0 to 1, where higher values indicate better faithfulness.

A generated answer is considered faithful if every claim it makes can be reliably inferred from the provided context. To compute this, the claims in the generated answer are first identified. Then, each claim is cross-referenced with the context to check whether it can be supported or inferred from it. The faithfulness score is calculated as follows:

from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas import SingleTurnSample
from ragas.metrics import Faithfulness
# Set your OpenAI API key
import os
os.environ["OPENAI_API_KEY"] = "REPLACE_WITH_YOUR_OpenAI_API_KEY"
# Initialize the OpenAI LLM and wrap it
llm = ChatOpenAI(model="gpt-4", temperature=0)
wrapped_llm = LangchainLLMWrapper(llm)
sample = SingleTurnSample(
    user_input="What is the role of the retriever in RAG?",
    response="The retriever in RAG selects relevant passages to provide contextually rich and accurate responses for the generator to produce an answer.",
    retrieved_contexts=["The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
    "One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
    "The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
    "Fine-tuning RAG involves training both the retriever and generator components.",
    "RAG stands for Retrieval-Augmented Generation, a method in natural language processing."]
)
scorer = Faithfulness(llm=wrapped_llm)
score = await scorer.single_turn_ascore(sample)
print(f"Faithfulness: {score}")

The result indicates that one-third of the claims can be inferred from the retrieved contexts. Since the Ragas code is open-source, we can debug by printing each step of the calculation. This allows us to understand how the final score is derived.

Conclusion#

Evaluating a retrieval-augmented generation (RAG) app with Ragas' metrics ensures that we optimize each component for better performance. By leveraging ChromaDB for efficient document retrieval and OpenAI's GPT-4 for answer generation, we built a system capable of answering questions with contextual accuracy. Through this process, we explored key RAG evaluation metrics—such as context precision, recall, noise sensitivity, response relevancy, and faithfulness—that help assess retrieval accuracy, minimize errors, and enhance response quality.

With Ragas, developers can iteratively fine-tune their RAG applications to reduce errors and improve reliability. Consider integrating it into your CI/CD pipeline to monitor performance changes over time and continuously refine your system for better user experience.

Interested in mastering retrieval-augmented generation (RAG) and integrating it into your applications for enhanced information retrieval and response generation? Explore the following courses:

Frequently Asked Questions

How does Ragas fit into a CI/CD pipeline?

Ragas metrics can act as automated checks in your CI/CD pipeline. For example:

Baseline creation: Before introducing changes, you run Ragas to establish baseline scores (e.g., precision: 0.75, recall: 0.80, faithfulness: 0.90).
Testing changes: When you modify your system (e.g., fine-tune the retriever, switch vector databases, or adjust prompt engineering), you run Ragas again to compare the new performance with the baseline. If scores drop (e.g., precision drops from 0.75 to 0.60), the pipeline flags it as a potential regression.
Automated alerts: Ragas can help ensure you don’t deploy updates that inadvertently degrade performance.

Example workflow in a CI/CD pipeline

Add Ragas tests to your pipeline: Run Ragas evaluation scripts as part of the “Test” phase in your pipeline. Compare current metrics to the previous baseline.
Pass/fail criteria: If metrics like context precision or faithfulness fall below a defined threshold (e.g., precision < 0.70), fail the pipeline to prevent deployment.
Output results: Use logs to monitor metric trends (e.g., “Precision increased by 5% after retriever fine-tuning”). Visualize trends in dashboards using tools like Grafana or Datadog.

What are the key metrics provided by Ragas for RAG evaluation?

Ragas provides several key metrics for RAG evaluation, including:

Context precision: Measures how relevant the retrieved chunks of text are for a given query.
Context recall: Evaluates the system’s ability to retrieve all relevant information.
Context entities recall: Assesses the retrieval of entities in a knowledge graph context.
Noise sensitivity: Evaluates how often irrelevant or incorrect information is included in the generated response.
Response relevancy: Measures how well the generated response aligns with the original query.
Faithfulness: Assesses how accurately the generated response reflects the facts from the retrieved context.

Can Ragas be used to evaluate agent-based AI systems?

Yes, Ragas includes metrics such as topic adherence, tool call accuracy, and agent goal accuracy, which are particularly useful for evaluating agent-based systems. These metrics assess the effectiveness of the agent in achieving its goals and using tools appropriately in a conversation.

What are synthetic test datasets in the context of Ragas?

Synthetic test datasets are artificially generated datasets used to evaluate RAG and agent-based AI systems. These datasets simulate various user queries and scenarios, allowing developers to test their models in a controlled environment before deploying them in real-world applications. Ragas provides tools for generating these synthetic datasets, ensuring comprehensive evaluation.

How can I integrate Ragas into my existing RAG-based applications?

To integrate Ragas into your RAG-based application, you need to install the Ragas Python library and use its built-in evaluation functions to assess the performance of your retriever and generator components. Ragas works seamlessly with existing RAG pipelines and can be used to evaluate and fine-tune your system’s performance.

Written By:

Asmat Batool