Home/Blog/Programming/RAG evaluation with Ragas
Home/Blog/Programming/RAG evaluation with Ragas

RAG evaluation with Ragas

15 min read
Apr 21, 2025
content
Retrieval-augmented generation process
Building a simple RAG app
Knowledge base for the RAG app
Vector store for document storage and retrieval
Retriever function: Fetching relevant context
Generative model: Answering the question
Testing the RAG app
Methods of evaluating LLM-based applications
Traditional methods
Non-traditional methods
Ragas RAG evaluation metrics
Context precision
Non-LLM-based context precision calculation: Using reference contexts aligned with the input query
LLM-based context precision calculation using reference answer to the input query
LLM-based context precision calculation without using reference answer to the input query
Context recall
Non-LLM-based context recall calculation
LLM-based context recall calculation
Context entities recall
Noise sensitivity
Response relevancy
Faithfulness
Conclusion

In today’s AI-driven landscape, creating personalized applications with Large Language Models (LLMs) is increasingly common. Techniques like RAG (retrieval-augmented generation) and fine-tuning language models on custom datasets are at the forefront, enabling developers to optimize models for specific tasks.

That said, a key challenge remains: how can we effectively measure the performance of these advanced AI systems? 

This is where LLM evaluation becomes crucial. Ragas, a Python library, provides a comprehensive suite of metrics for evaluating LLM-based applications, ensuring that models retrieve relevant information and generate high-quality responses. In this blog, we’ll explore LLM RAG (simply RAG) and demonstrate how Ragas helps developers refine RAG systems. We’ll build a simple question-answering application using RAG and evaluate its performance with Ragas.

Ragas
Ragas

Before diving into RAG evaluation with Ragas, let’s first explore the RAG workflow and its components to determine what needs to be evaluated.

Retrieval-augmented generation process#

In a RAG application, when a user submits a question or prompt, the system retrieves relevant passages from a knowledge base, such as the internet, internal company documents, or other text repositories. These passages are combined with the user’s query to provide additional context, enabling the LLM to generate more accurate and context-aware responses.

Retrieval-augmented generation process
Retrieval-augmented generation process

Therefore, in RAG, the following things need to be evaluated:

  1. Retrieval quality: How well does the system retrieve relevant and accurate information from the knowledge base? Poor retrieval can result in irrelevant or misleading context for the LLM.

  2. Response quality: How effectively does the LLM generate responses that align with the retrieved context, ensuring relevance and factual accuracy? How reliable is the LLM in producing accurate responses even when the retrieved context is incomplete or irrelevant?

With this understanding of the RAG workflow and its evaluation criteria, we’ll now build a simple question-answering RAG application and evaluate its performance using Ragas. We’ll explore RAG metrics provided by Ragas and use those metrics to evaluate the RAG application.

Building a simple RAG app#

Our RAG app relies on a vector database to store knowledge about a specific topic. It retrieves relevant chunks of information from the database, which the generative model then uses to answer queries about the topic.

Knowledge base for the RAG app#

A knowledge base is a collection of documents/sentences that contains useful information for the application. Our knowledge base consists of multiple statements about the RAG framework, its components, and its applications.

Here’s a Python list representing our knowledge base data, taken from Kaggle.

knowledge_base = [
"RAG stands for Retrieval-Augmented Generation, a method in natural language processing.",
"It combines the power of retrieval-based models with generative models to improve response quality.",
"The retriever fetches relevant documents based on a query.",
"The generator uses the retrieved documents to generate a coherent and informative response.",
"This approach leverages large-scale pretrained models for both retrieval and generation tasks.",
"Fine-tuning RAG involves training both the retriever and generator components.",
"RAG can be used in various applications, including chatbots, question-answering systems, and more.",
"The framework was introduced by Facebook AI Research (FAIR) in 2020.",
"RAG aims to improve the informativeness and accuracy of generated responses.",
"The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
"Generative models in RAG are typically based on architectures like BERT, GPT, or T5.",
"RAG can handle large-scale knowledge bases and provide specific answers to queries.",
"The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
"One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
"Training RAG requires a large and diverse dataset to cover a wide range of possible queries.",
"RAG has shown significant improvements over traditional retrieval-based or generative models alone.",
"The architecture of RAG allows it to be fine-tuned for specific tasks or domains.",
"RAG integrates retrieval and generation in a seamless manner, improving overall system performance.",
"The use of retrieval-augmented generation helps in reducing hallucinations in generated text.",
"RAG's design allows it to leverage external knowledge sources effectively.",
"The application of RAG extends to areas like medical diagnosis, legal advice, and customer support.",
"By using retrieval-augmented techniques, RAG ensures that the responses are grounded in real data.",
"The flexibility of RAG makes it suitable for various languages and dialects.",
"RAG's performance can be enhanced by continuously updating the knowledge base with new information.",
"Researchers are exploring ways to make RAG more efficient and scalable for real-time applications.",
"The integration of retrieval and generation in RAG provides a powerful tool for AI developers.",
]
This is the data in the knowledge base to be used by the retriever in RAG (source: https://www.kaggle.com/code/arashnic/rag-with-sentence-and-hugging-face-transformers)

Each item in this list represents a piece of information, often referred to as a document in vector databases. These documents will be indexed and stored as vectors in the database, enabling efficient similarity-based retrieval.

Vector store for document storage and retrieval#

To implement this, we’ll use ChromaDB, a vector database that allows us to store and organize documents as collections and search them efficiently. The vector database allows the retrieval system (the retriever) to fetch the most relevant information based on the query input.

Here’s how we add the knowledge base data to ChromaDB:

import chromadb
# Initialize ChromaDB
client = chromadb.Client()
collection = client.get_or_create_collection(name="knowledge_base_collection")
# Add knowledge base documents to ChromaDB
for i, doc in enumerate(knowledge_base):
collection.add(documents=[doc], ids=[str(i)])
Storing knowledge base data to Chroma vector database

This code initializes the ChromaDB client, creates a collection for the knowledge base, and then adds each document from the knowledge base into the collection.

Retriever function: Fetching relevant context#

The retriever’s job is to retrieve the most relevant context for answering a question. Given a query, the retriever searches through the knowledge base and returns the top k most relevant documents. The generator will then use this context to formulate an answer.

We define a function get_context() that queries the vector database and retrieves the top k results for a given question:

def get_context(question, top_k=5):
results = collection.query(query_texts=[question], n_results=top_k)
return " ".join(results["documents"][0])
Function to retrieve relevant context

In this function:

  • The question is the input query from the user.

  • The top_k parameter defines how many relevant documents to retrieve.

  • The results are then concatenated into a single string of context, which is passed to the next step to generate an answer.

Generative model: Answering the question#

Once we have the context, we need to use a generative model (in this case, OpenAI's GPT-4o mini ) to generate an answer based on the retrieved information. The generate_answer function is used for this purpose:

To use OpenAI's model, you need an OpenAI API key.

import openai
# OpenAI API Key
openai.api_key = "REPLACE_WITH_YOUR_OpenAI_API_KEY"
# Function to generate an answer
def generate_answer(question):
context = get_context(question)
print("Retrieved text chunks (Context): ", context)
prompt = f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
Query augmentation and response generation

Here, we:

  • Retrieve the context using the get_context function on line 7.

  • Augment the user’s query by constructing a prompt for the model that includes the context and the question on line 9.

  • Call OpenAI’s API to generate a response based on the prompt on lines 10–13.

Testing the RAG app#

Now that we’ve set up the retriever and generator, let’s test the pipeline with a sample question:

question = "What is the role of the retriever in RAG?"
answer = generate_answer(question)
print(f"Q: {question}\nA: {answer}")
Testing RAG

This will:

  • Use the retriever to fetch relevant documents from the knowledge base.

  • Use the generator to produce an answer based on the retrieved documents.

Output:

Retrieved text chunks (Context): The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models. One of the key benefits of RAG is its ability to provide contextually rich and accurate responses. The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer. Fine-tuning RAG involves training both the retriever and generator components. RAG stands for Retrieval-Augmented Generation, a method in natural language processing.
Q: What is the role of the retriever in RAG?
A: The retriever in RAG selects relevant passages to provide contextually rich and accurate responses for the generator to produce an answer.
Retrieved context and the generated answer

We have finished setting up an LLM-based RAG app that we will evaluate using Ragas. Before we use the RAG evaluation metrics provided by Ragas, we need to understand the traditional and LLM-based evaluation methods to better understand these metrics.

Methods of evaluating LLM-based applications#

Evaluation methods for LLM applications can be broadly classified into two types: traditional and non-traditional.

Traditional methods#

Traditional methods rely on analyzing the exact arrangement and order of words and phrases in the text. They compare the generated output to a predefined reference text, often referred to as the ground truth. Metrics such as BLEU, ROUGE, and exact match are typical examples. They measure how closely the model’s predictions align with the expected response by evaluating aspects like word overlap, sequence similarity, and syntactic structure.

In Ragas, these methods are referred to as non-LLM-based methods.

Non-traditional methods#

These methods leverage the semantic understanding and representational capabilities of LLMs to evaluate text. Instead of focusing solely on surface-level word arrangement, they assess the deeper meaning, coherence, and contextual relevance of the text. Non-traditional metrics can function both with a reference text (to gauge how well the meaning aligns) and without one (to assess intrinsic quality).

In Ragas, these methods are referred to as LLM-based methods.

Ragas RAG evaluation metrics#

Now that we understand how LLM-based applications can be evaluated using traditional and non-traditional methods, let’s see how Ragas simplifies this process with its built-in metrics. 

The image below highlights the RAG evaluation metrics offered by Ragas. We will understand what each metric measures and how to calculate the score for each one:

Ragas RAG evaluation metrics
Ragas RAG evaluation metrics

Context precision#

Context precision evaluates how good the system is at finding the right information. It measures what fraction of the text chunks it retrieves are actually relevant to the query.

Context precision is represented by Context Precision@K\text{Context Precision@K} and is calculated by taking the average of precision scores for all relevant chunks in the the top K\text{K} retrieved text chunks.

where, vk=1v_k = 1 if the chunk at rank kk is relevant; otherwise, vk=0v_k = 0.

Rank refers to the position of a chunk in the list of retrieved text chunks, sorted by the retriever’s estimated relevance to the query. This differs from the actual relevance used for RAG evaluation, where vk=1v_k=1 if the chunk at rank kk is truly relevant and vk=0v_k=0 otherwise.

Precision@k\text{Precision@k} measures the proportion of relevant chunks among the top k\text{k} retrieved chunks. It is calculated as:

Where:

  • true positives@k\text{true\space positives@k} represents the number of relevant chunks among the top kk retrieved chunks.

  • false positives@k\text{false\space positives@k} represents the number of non-relevant chunks among the top kk retrieved chunks.

Wondering how we can determine if a retrieved chunk is actually relevant?
Ragas enables the evaluation of context relevance both with and without ground truth. It provides three methods to calculate context precision: two that are reference-based and one that does not require a reference.

Non-LLM-based context precision calculation: Using reference contexts aligned with the input query#

This method uses traditional metrics, such as string similarity, BLEU score, and string presence, as distance measures to assess how closely a retrieved context matches the reference contexts.

from ragas import SingleTurnSample
sample = SingleTurnSample(
retrieved_contexts=["The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
"One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
"The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
"Fine-tuning RAG involves training both the retriever and generator components.",
"RAG stands for Retrieval-Augmented Generation, a method in natural language processing."],
reference_contexts=["The retriever fetches relevant documents based on a query.",
"The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer."]
)
from ragas.metrics import NonLLMContextPrecisionWithReference
context_precision = NonLLMContextPrecisionWithReference()
score = await context_precision.single_turn_ascore(sample)
print(f"Non-LLM-based context precision with reference: {score}")
  • Line 1: We import the SingleTurnSample class from Ragas library, which represents a single interaction typically consisting of a query, the retrieved contexts in response to it, and the reference contexts used for evaluation.

  • Lines 3–11: We create an evaluation sample that includes retrieved contexts and reference contexts needed to calculate the context precision score. Depending on the requirements of a specific metric, the sample can also include other inputs, such as user input, generated response, or reference response. We will see such samples as we progress.

  • Line 13: We import the specific metric from the Ragas library that we want to use for calculating the evaluation score.

  • Line 15 We create an instance of the NonLLMContextPrecisionWithReference class. This instance will be used to compute the context precision score.

  • Line 16: We call single_turn_ascore method on the metric instance, passing the evaluation sample (sample) as input. This method computes the context precision score for the retrieved contexts using the provided reference contexts. The computation uses non-LLM-based distance measures like string similarity, BLEU, or exact match.

  • Line 18: We print the computed precision score to show how well the retrieved contexts match the reference contexts.

Output:

Non-LLM-based context precision with reference: 0.3333333333
The output of non-LLM-based context precision calculation using a reference answer

Only the retrieved context at rank 3 matches one of the two reference contexts, while the remaining four are irrelevant. So Precision@3=1/3\text{Precision@3} = 1/3 and for k=1,2,4,5\text{k} = 1, 2, 4, 5, Precision@k=0\text{Precision@k}=0. It follows that Context Precision@5=1/3\text{Context Precision@5} = 1/3, which is what we see in the output above.

LLM-based context precision calculation using reference answer to the input query#

In this method, LLM is used to determine the relevance of a retrieved context by comparing it with a reference answer. For LLM-based evaluation methods, we need to set up the LLM that Ragas will use for evaluation.

from ragas import SingleTurnSample
# A sample from evaluation dataset
sample = SingleTurnSample(
user_input = "What is the role of the retriever in RAG?",
reference = "The role of the retriever in RAG is to fetch relevant passages based on a query for the generator to produce an answer.",
retrieved_contexts = ["The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
"One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
"The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
"Fine-tuning RAG involves training both the retriever and generator components.",
"RAG stands for Retrieval-Augmented Generation, a method in natural language processing."],
)
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
# Set OpenAI API key
import os
os.environ["OPENAI_API_KEY"] = "REPLACE_WITH_YOUR_OpenAI_API_KEY"
# Initialize the OpenAI LLM and wrap it
llm = ChatOpenAI(model="gpt-4")
wrapped_llm = LangchainLLMWrapper(llm)
from ragas.metrics import LLMContextPrecisionWithReference
context_precision = LLMContextPrecisionWithReference(llm=wrapped_llm)
score = await context_precision.single_turn_ascore(sample)
print(f"LLM-based context precision with reference: {score}")
  • Lines 1–11: We define an evaluation sample using the SingleTurnSample class, which includes the user input, reference response, and retrieved contexts. This sample provides the data needed for evaluation.

  • Lines 13–20: We import the necessary libraries to set up the LLM provided by OpenAI. We configure the API key to authenticate our application with OpenAI services, enabling access to their LLM. Once configured, we initialize the OpenAI LLM, specifying the gpt-4 model. Finally, we wrap this LLM with the LangchainLLMWrapper to make it compatible with the Ragas framework.

  • Lines 22–26: We import the LLMContextPrecisionWithReference metric from Ragas, instantiate it with the wrapped LLM, and calculate the context precision score for the given sample. The resulting score measures how well the retrieved contexts align with the reference using LLM-based evaluation.

Output:

LLM-based context precision with reference: 0.3333333333
The output of LLM-based context precision calculation using a reference answer

The retrieved context at rank 3 closely matches the reference answer.

LLM-based context precision calculation without using reference answer to the input query#

In this method, LLM is used to determine the relevance of a retrieved context by comparing it with the LLM’s response to the input query.

from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithoutReference
# A sample from evaluation dataset
sample = SingleTurnSample(
user_input = "What is the role of the retriever in RAG?",
response = "The retriever in RAG selects relevant passages to provide contextually rich and accurate responses for the generator to produce an answer.",
retrieved_contexts = ["The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
"One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
"The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
"Fine-tuning RAG involves training both the retriever and generator components.",
"RAG stands for Retrieval-Augmented Generation, a method in natural language processing."],
)
# Set your OpenAI API key
import os
os.environ["OPENAI_API_KEY"] = "REPLACE_WITH_YOUR_OpenAI_API_KEY"
# Initialize the OpenAI LLM and wrap it
llm = ChatOpenAI(model="gpt-4", temperature=0)
wrapped_llm = LangchainLLMWrapper(llm)
context_precision = LLMContextPrecisionWithoutReference(llm=wrapped_llm)
score = await context_precision.single_turn_ascore(sample)
print(f"LLM-based context precision without reference: {score}")

Output:

LLM-based context precision without reference: 0.3333333333
The output of LLM-based context precision calculation without using any reference

The retrieved context at rank 3 closely matches the LLM’s response the the query.

Context recall#

Context recall measures how many relevant text chunks are successfully retrieved. The relevance of retrieved chunks is determined by predefined reference contexts, which serve as the ground truth. The score ranges from 00 to 11, where 11 means that every reference context has at least one matching retrieved chunk.

A higher recall score indicates that fewer relevant items (as defined by the reference contexts) were missed during retrieval.

Non-LLM-based context recall calculation#

This method uses simple string comparison methods to check if a reference context matches any of the retrieved text chunks. The formula for calculating context recall in this method is as follows:

from ragas import SingleTurnSample
from ragas.metrics import NonLLMContextRecall
sample = SingleTurnSample(
retrieved_contexts=["The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
"One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
"The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
"Fine-tuning RAG involves training both the retriever and generator components.",
"RAG stands for Retrieval-Augmented Generation, a method in natural language processing."],
reference_contexts=["The retriever fetches relevant documents based on a query.",
"The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer."]
)
context_recall = NonLLMContextRecall()
score = await context_recall.single_turn_ascore(sample)
print(f"Non-LLM-based context recall with reference contexts: {score}")

Output:

Non-LLM-based context recall with reference contexts: 0.5
The output of non-LLM-based context recall calculation with reference contexts

Out of the two reference contexts, one has a matching retrieved context.

Challenge: Annotating reference contexts can be a time-consuming task.
Ragas simplifies this by using a reference answer as a proxy for multiple reference contexts! To determine the context recall, reference answer is split into claims. Claims are individual assertions that are self-contained and convey a fact or proposition that can be verified or attributed to a particular source or context. For each claim, we determine whether it is supported by the retrieved contexts. Ideally, every claim in the reference answer should be supported by the retrieved context.

LLM-based context recall calculation#

In this method, the relevance between a claim and the retrieved context is determined using LLM. The formula for calculating context recall in this method is as follows:

from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import LLMContextRecall
# Set your OpenAI API key
import os
os.environ["OPENAI_API_KEY"] = "REPLACE_WITH_YOUR_OpenAI_API_KEY"
# Initialize the OpenAI LLM and wrap it
llm = ChatOpenAI(model="gpt-4", temperature=0)
wrapped_llm = LangchainLLMWrapper(llm)
sample = SingleTurnSample(
user_input="What is the role of the retriever in RAG?",
reference="The role of the retriever in RAG is to fetch relevant passages based on a query for the generator to produce an answer.",
response="The retriever in RAG selects relevant passages to provide contextually rich and accurate responses for the generator to produce an answer.",
retrieved_contexts=["The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
"One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
"The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
"Fine-tuning RAG involves training both the retriever and generator components.",
"RAG stands for Retrieval-Augmented Generation, a method in natural language processing."],
)
context_recall = LLMContextRecall(llm=wrapped_llm)
score = await context_recall.single_turn_ascore(sample)
print(f"LLM-based context recall with reference answer: {score}")

Output:

LLM-based context recall with reference answer: 1.0
The output of LLM-based context recall calculation with the reference answer

All possible claims in the reference answer can be attributed to the third retrieved context.

Context entities recall#

Context entities recall is a metric that evaluates the proportion of entities from the reference answer that are successfully retrieved in the context.

Entities are distinct, meaningful units of information in a text. They typically refer to people, places, organizations, concepts, or objects. For example, in the sentence “Marie Curie won the Nobel Prize for her work on radioactivity,” the extracted entities could be Marie Curie (Person), Nobel Prize (Award), and Radioactivity (Scientific Concept).

The context entity recall metric uses the wrapped LLM to extract entities from both the reference and the retrieved contexts during evaluation, enabling a structured comparison.

Following is the formula to compute context entities recall metric:

from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas import SingleTurnSample
from ragas.metrics import ContextEntityRecall
# Set your OpenAI API key
import os
os.environ["OPENAI_API_KEY"] = "REPLACE_WITH_YOUR_OpenAI_API_KEY"
# Initialize the OpenAI LLM and wrap it
llm = ChatOpenAI(model="gpt-4", temperature=0)
wrapped_llm = LangchainLLMWrapper(llm)
# Create a sample for evaluation
sample_1 = SingleTurnSample(
reference="The role of the retriever in RAG is to fetch relevant passages based on a query for the generator to produce an answer.",
retrieved_contexts=["The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
"One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
"The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
"Fine-tuning RAG involves training both the retriever and generator components.",
"RAG stands for Retrieval-Augmented Generation, a method in natural language processing."],
)
# Create another sample assuming that the retreived contexts are same as reference contexts; epxecting a higher context entities recall score
sample_2 = SingleTurnSample(
reference="The role of the retriever in RAG is to fetch relevant passages based on a query for the generator to produce an answer.",
retrieved_contexts=["The retriever fetches relevant documents based on a query.",
"The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer."],
)
# Initialize the ContextEntityRecall scorer with the wrapped LLM
scorer = ContextEntityRecall(llm=wrapped_llm)
# Calculate the score
score_1 = await scorer.single_turn_ascore(sample_1)
print(f"LLM-based context entities recall with reference answer: {score_1}")
score_2 = await scorer.single_turn_ascore(sample_2)
print(f"LLM-based context entities recall assuming rerieved contexts are same as the reference contexts: {score_2}")

Output:

LLM-based context entities recall with reference answer: 0.33333333277777777
LLM-based context entities recall assuming rerieved contexts are same as the reference contexts: 0.9999999983333333
The output of LLM-based context entities recall calculation with reference answer

For sample 1, the result shows that one-third of the entities in the reference answer match those in the retrieved contexts. While it might seem that most of the entities from the reference answer are present in the retrieved contexts, the outcome entirely depends on the entity extraction method used behind the scenes.

For sample 2, the result shows that most of the entities in the reference answer match the entities in the retrieved contexts.

So far, we have examined the evaluation metrics for the context retriever. Now, we will examine the evaluation metrics for the response generator, which is essentially an LLM.

Noise sensitivity#

Noise sensitivity is a metric used to evaluate how often a system produces incorrect responses when utilizing either relevant or irrelevant retrieved contexts. It is computed based on the user input, reference answer, generated response, and retrieved contexts.

To calculate noise sensitivity, the generated response is broken down into individual claims, and each claim is evaluated for correctness. Claims (response statements) that are not supported by the reference answer are classified as incorrect claims.

Noise sensitivity is calculated under the assumption that the retrieved contexts are relevant. In such cases, any incorrect claims in the response are attributed to the generator rather than the retriever. The formula for calculating noise sensitivity when the retrieved contexts are relevant is as follows:

Noise sensitivity scores range from 0 to 1. A lower noise sensitivity value indicates better performance.

from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas import SingleTurnSample
from ragas.metrics import NoiseSensitivity
# Set your OpenAI API key
import os
os.environ["OPENAI_API_KEY"] = "REPLACE_WITH_YOUR_OpenAI_API_KEY"
# Initialize the OpenAI LLM and wrap it
llm = ChatOpenAI(model="gpt-4", temperature=0)
wrapped_llm = LangchainLLMWrapper(llm)
# Creating the evaluation sample from the original example
sample = SingleTurnSample(
user_input = "What is the role of the retriever in RAG?",
reference ="The role of the retriever in RAG is to fetch relevant passages based on a query for the generator to produce an answer.",
response = "The retriever in RAG selects relevant passages to provide contextually rich and accurate responses for the generator to produce an answer.",
retrieved_contexts=["The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
"One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
"The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
"Fine-tuning RAG involves training both the retriever and generator components.",
"RAG stands for Retrieval-Augmented Generation, a method in natural language processing."],
)
scorer = NoiseSensitivity(llm=wrapped_llm)
score= await scorer.single_turn_ascore(sample)
print(f"Noise sensitivity: {score}")

Output:

Noise sensitivity : 0.0
The output of LLM-based noise sensitivity score calculation with the reference answer

A noise sensitivity value of zero indicates that the system has generated a correct response.

Response relevancy#

Response relevancy measures how well the generated response aligns with the given prompt. Responses that are incomplete or include unnecessary/redundant details receive lower scores, while more relevant and precise answers achieve higher scores.

This metric is determined by generating a set of artificial questions from the response (essentially reverse-engineering it) and calculating the average cosine similarity between these questions and the original user input. A higher average similarity indicates better relevancy. Following is the formula to calculate answer relevancy:

Where:

  • EgiE_{g_i} is the embedding of the generated question ii.

  • EoE_o is the embedding of the original question.

  • NN is the number of generated questions, which is 3 by default.

from langchain_openai import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas import SingleTurnSample
from ragas.metrics import ResponseRelevancy
# Set your OpenAI API key
import os
os.environ["OPENAI_API_KEY"] = "REPLACE_WITH_YOUR_OpenAI_API_KEY"
# Initialize the OpenAI LLM and wrap it
llm = ChatOpenAI(model="gpt-4", temperature=0)
wrapped_llm = LangchainLLMWrapper(llm)
# Initialize the embeddings model
embeddings = OpenAIEmbeddings() # Uses OpenAI embeddings by default
sample = SingleTurnSample(
user_input="What is the role of the retriever in RAG?",
response="The retriever in RAG selects relevant passages to provide contextually rich and accurate responses for the generator to produce an answer.",
retrieved_contexts=["The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
"One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
"The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
"Fine-tuning RAG involves training both the retriever and generator components.",
"RAG stands for Retrieval-Augmented Generation, a method in natural language processing."]
)
scorer = ResponseRelevancy(llm=wrapped_llm, embeddings=embeddings)
score = await scorer.single_turn_ascore(sample)
print(f"Respone relevancy: {score}")
  • Line 2: We import OpenAI’s embedding model to compute vector representations of text.

  • Line 16: We initialize OpenAI’s embedding model, which generates vector representations of text.

  • Line 28: We pass the embedding model to the response relevancy scorer for similarity calculation.

Output:

Respone relevancy: 0.9792333695213831
The output of response relevancy calculation

A response relevancy score of 0.979 indicates that the response is relevant to the given prompt.

Faithfulness#

Faithfulness is a metric that evaluates how consistent the generated answer is with the given context, in terms of factual accuracy. The score ranges from 0 to 1, where higher values indicate better faithfulness.

A generated answer is considered faithful if every claim it makes can be reliably inferred from the provided context. To compute this, the claims in the generated answer are first identified. Then, each claim is cross-referenced with the context to check whether it can be supported or inferred from it. The faithfulness score is calculated as follows:

from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas import SingleTurnSample
from ragas.metrics import Faithfulness
# Set your OpenAI API key
import os
os.environ["OPENAI_API_KEY"] = "REPLACE_WITH_YOUR_OpenAI_API_KEY"
# Initialize the OpenAI LLM and wrap it
llm = ChatOpenAI(model="gpt-4", temperature=0)
wrapped_llm = LangchainLLMWrapper(llm)
sample = SingleTurnSample(
user_input="What is the role of the retriever in RAG?",
response="The retriever in RAG selects relevant passages to provide contextually rich and accurate responses for the generator to produce an answer.",
retrieved_contexts=["The retriever component of RAG can be based on various architectures like BM25 or dense retrieval models.",
"One of the key benefits of RAG is its ability to provide contextually rich and accurate responses.",
"The retriever in RAG selects relevant passages, which are then used by the generator to produce an answer.",
"Fine-tuning RAG involves training both the retriever and generator components.",
"RAG stands for Retrieval-Augmented Generation, a method in natural language processing."]
)
scorer = Faithfulness(llm=wrapped_llm)
score = await scorer.single_turn_ascore(sample)
print(f"Faithfulness: {score}")

Output:

Faithfulness: 0.3333333333333333
The output of the faithfulness score calculation

The result indicates that one-third of the claims can be inferred from the retrieved contexts. Since the Ragas code is open-source, we can debug by printing each step of the calculation. This allows us to understand how the final score is derived.

Conclusion#

Evaluating a retrieval-augmented generation (RAG) app with Ragas' metrics ensures that we optimize each component for better performance. By leveraging ChromaDB for efficient document retrieval and OpenAI's GPT-4 for answer generation, we built a system capable of answering questions with contextual accuracy. Through this process, we explored key RAG evaluation metrics—such as context precision, recall, noise sensitivity, response relevancy, and faithfulness—that help assess retrieval accuracy, minimize errors, and enhance response quality.

With Ragas, developers can iteratively fine-tune their RAG applications to reduce errors and improve reliability. Consider integrating it into your CI/CD pipeline to monitor performance changes over time and continuously refine your system for better user experience.

Interested in mastering retrieval-augmented generation (RAG) and integrating it into your applications for enhanced information retrieval and response generation? Explore the following courses:

Frequently Asked Questions

How does Ragas fit into a CI/CD pipeline?

Ragas metrics can act as automated checks in your CI/CD pipeline. For example:

  • Baseline creation: Before introducing changes, you run Ragas to establish baseline scores (e.g., precision: 0.75, recall: 0.80, faithfulness: 0.90).
  • Testing changes: When you modify your system (e.g., fine-tune the retriever, switch vector databases, or adjust prompt engineering), you run Ragas again to compare the new performance with the baseline. If scores drop (e.g., precision drops from 0.75 to 0.60), the pipeline flags it as a potential regression.
  • Automated alerts: Ragas can help ensure you don’t deploy updates that inadvertently degrade performance.

Example workflow in a CI/CD pipeline

  • Add Ragas tests to your pipeline: Run Ragas evaluation scripts as part of the “Test” phase in your pipeline. Compare current metrics to the previous baseline.
  • Pass/fail criteria: If metrics like context precision or faithfulness fall below a defined threshold (e.g., precision < 0.70), fail the pipeline to prevent deployment.
  • Output results: Use logs to monitor metric trends (e.g., “Precision increased by 5% after retriever fine-tuning”). Visualize trends in dashboards using tools like Grafana or Datadog.

What are the key metrics provided by Ragas for RAG evaluation?

Can Ragas be used to evaluate agent-based AI systems?

What are synthetic test datasets in the context of Ragas?

How can I integrate Ragas into my existing RAG-based applications?


Written By:
Asmat Batool
Join 2.5 million developers at
Explore the catalog

Free Resources