RAG Evaluation
Here is the content for Day 46.
What You'll Build Today
You have spent the last few days building Retrieval Augmented Generation (RAG) systems. You know how to chop up documents, store them in vector databases, and make an LLM answer questions based on that data.
But here is the hard question: Is your RAG system actually good?
When you ask it a question, does it find the right document chunk? Does the answer actually come from that chunk, or is the AI hallucinating? Today, we are moving from "it feels right" to "here is the data."
You will build an Automated RAG Evaluation Suite.
Here is what you will learn:
* The Ragas Library: A specialized toolkit designed specifically to grade RAG applications.
* Retrieval Metrics: How to measure if your system found the correct document (Context Precision and Recall).
* Generation Metrics: How to measure if the AI's answer is factually supported by the text (Faithfulness) and if it actually answered the user's question (Answer Relevance).
* Synthetic Test Data: How to use an LLM to generate test questions for you, so you don't have to write them manually.
Let's turn your AI development into a science.
The Problem
Imagine you have built a RAG chatbot for a company's HR policy. You launch it, and your manager asks, "How accurate is it?"
You say, "Well, I asked it five questions and it looked pretty good."
Your manager says, "We have 500 pages of policy. What happens if we change the chunk size from 1000 to 500? Does it get better or worse?"
To answer this, you would have to manually read the documents, ask a question, read the AI's answer, and grade it yourself. Then you would have to do that 50 or 100 times to get a statistically significant result.
Here is what that manual workflow looks like in code. It is painful.
# The "Painful" Manual Evaluation
questions = [
"What is the vacation policy?",
"How do I claim expenses?",
"What are the core working hours?"
]
# Imagine this is your RAG system
def simple_rag(question):
return "Here is a generated answer..."
results = []
print("Starting Manual Evaluation... get some coffee, this will take forever.")
for q in questions:
# 1. Run the RAG
answer = simple_rag(q)
# 2. Show it to the human (YOU)
print(f"\nQuestion: {q}")
print(f"AI Answer: {answer}")
# 3. Ask human to grade it
is_correct = input("Is this accurate? (y/n): ")
results.append(1 if is_correct.lower() == 'y' else 0)
accuracy = sum(results) / len(results)
print(f"\nFinal Accuracy: {accuracy * 100}%")
Why this fails:
There has to be a way to automate this. We need an "LLM-as-a-Judge"—using a smart AI (like GPT-4) to grade the output of our RAG system based on specific mathematical criteria.
Let's Build It
We will use a library called Ragas (Retrieval Augmented Generation Assessment). It provides standard metrics to score your system.
Step 1: Setup and Installation
We need ragas, langchain, and datasets (a library by HuggingFace to manage data).
import os
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from datasets import Dataset
# Set your API key here
os.environ["OPENAI_API_KEY"] = "sk-..."
# We will use GPT-4 for evaluation (the judge needs to be smart)
# and GPT-3.5 or GPT-4 for the RAG system itself.
llm = ChatOpenAI(model="gpt-3.5-turbo")
evaluator_llm = ChatOpenAI(model="gpt-4")
Step 2: Create a Mini-RAG System
To test a RAG system, we first need to build one. We will use a simple text about the Solar System as our knowledge base.
# 1. The Knowledge Base
text = """
The Solar System is the gravitationally bound system of the Sun and the objects that orbit it.
It formed 4.6 billion years ago from the gravitational collapse of a giant interstellar molecular cloud.
The vast majority of the system's mass is in the Sun, with the majority of the remaining mass contained in Jupiter.
The four inner system planets—Mercury, Venus, Earth, and Mars—are terrestrial planets, being composed primarily of rock and metal.
The four giant planets of the outer system are substantially more massive than the terrestrials.
The two largest, Jupiter and Saturn, are gas giants, being composed mainly of hydrogen and helium;
the two outermost planets, Uranus and Neptune, are ice giants, being composed mostly of substances with relatively high melting points compared with hydrogen and helium, called volatiles, such as water, ammonia, and methane.
"""
# 2. Chunking and Indexing
splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)
chunks = splitter.create_documents([text])
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=OpenAIEmbeddings()
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
# 3. The RAG Chain
template = """Answer the question based only on the following context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
print("RAG System built successfully.")
Step 3: Preparing the Evaluation Data
Ragas requires four columns of data to perform a full evaluation:
Let's manually create a small test set.
# Our test questions and the "correct" answers we expect
test_questions = [
"How old is the solar system?",
"What are Uranus and Neptune composed of?",
"What is the largest planet?" # Note: The text mentions Jupiter has mass, but implies size. Let's see how it handles it.
]
ground_truths = [
["It formed 4.6 billion years ago."],
["They are ice giants composed mostly of water, ammonia, and methane."],
["Jupiter."]
]
data = {
"question": [],
"answer": [],
"contexts": [],
"ground_truth": []
}
print("Running inference on test set...")
for q, gt in zip(test_questions, ground_truths):
# 1. Run the RAG to get the answer
generated_answer = rag_chain.invoke(q)
# 2. Retrieve the contexts manually so we can log them
# (The chain does this internally, but we need to capture them for the evaluator)
docs = retriever.get_relevant_documents(q)
retrieved_contexts = [doc.page_content for doc in docs]
# 3. Store data
data["question"].append(q)
data["answer"].append(generated_answer)
data["contexts"].append(retrieved_contexts)
data["ground_truth"].append(gt)
# Convert to a HuggingFace Dataset object (required by Ragas)
dataset = Dataset.from_dict(data)
print("Data preparation complete. Here is a sample:")
print(dataset[0])
Step 4: Measuring "Faithfulness" and "Answer Relevance"
Now we import the metrics.
* Faithfulness: Does the answer stick to the context? (Score 0 to 1). If the context says "The sky is blue" and the AI says "The sky is green," faithfulness is 0.
* Answer Relevance: Did the AI answer the actual question? If I ask "What is your name?" and the AI says "I like pizza," relevance is 0.
* Context Precision: Did the retriever find the relevant chunk in the list of results?
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevance,
context_precision,
)
# We list the metrics we want to test
metrics = [
faithfulness,
answer_relevance,
context_precision,
]
print("Starting evaluation (this calls OpenAI API multiple times)...")
results = evaluate(
dataset=dataset,
metrics=metrics,
)
print("\n=== Evaluation Results ===")
print(results)
Step 5: Analyzing the Output
The results object gives you an average score, but we often want to see the breakdown per question to debug specific failures. We can convert the results to a Pandas DataFrame.
import pandas as pd
# Convert results to a readable table
df = results.to_pandas()
# Display specific columns
pd.set_option('display.max_colwidth', 50) # Don't truncate text too much
display_cols = ['question', 'answer', 'faithfulness', 'answer_relevance', 'context_precision']
print(df[display_cols])
What to look for:
* Low Faithfulness: Your LLM is hallucinating info not in the text.
* Low Context Precision: Your retriever (vector database) is failing to find the right chunk. You might need to change your chunk size or k value.
* Low Answer Relevance: Your prompt template might be confusing the LLM.
Step 6: Generating Synthetic Test Data
The hardest part of evaluation is coming up with the ground_truth questions and answers. Ragas can actually generate these for you by reading your documents!
This uses the "TestsetGenerator".
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
# 1. Initialize the generator with our generator and critic models
generator = TestsetGenerator.with_openai()
# 2. Generate questions based on our documents
# We will generate 3 simple questions for this demo
testset = generator.generate_with_langchain_docs(
chunks,
test_size=3,
distributions={simple: 1.0} # 100% simple questions
)
synthetic_df = testset.to_pandas()
print("\n=== Synthetic Test Data Generated ===")
print(synthetic_df[['question', 'ground_truth']].head())
Now you have a dataset automatically created from your own documents that you can use to test your system every time you make a change.
Now You Try
You have a working evaluation pipeline. Now, let's use it to run some experiments.
Go back to Step 2. Change the retriever search kwargs to k=1. This means the system only retrieves ONE chunk. Run the evaluation again. Does context_precision or faithfulness drop? (It should, because it might miss necessary context).
Add a manual entry to your data dictionary in Step 3.
* Question: "What is the capital of Mars?"
* Answer: "The capital of Mars is Elon City." (Intentionally fake).
* Contexts: [The solar system text...]
* Ground Truth: ["Mars has no capital."]
Run the evaluation. Look at the faithfulness score for that specific row. It should be very low.
Take the pandas DataFrame from Step 5 and save it to a CSV file named rag_report_card.csv. This is the file you would hypothetically email to your manager to prove the system works.
Challenge Project
The "Before & After" ExperimentYour challenge is to prove that changing a setting improves your RAG system.
Requirements:k=1.k=3.TestsetGenerator.evaluate() on Configuration A.evaluate() on Configuration B.Running Config A (Bad)...
Average Faithfulness: 0.65
Average Context Precision: 0.40
Running Config B (Good)...
Average Faithfulness: 0.92
Average Context Precision: 0.88
Conclusion: Configuration B improved Context Precision by 120%.
Hint: You will need to create two different vectorstore objects (one for each chunk size) to run this comparison properly.
What You Learned
Today you moved from "vibes-based" development to "metrics-based" engineering.
* Faithfulness tells you if your AI is lying.
* Answer Relevance tells you if your AI is helpful.
* Context Precision tells you if your database search is working.
* Synthetic Data saves you from writing hundreds of test questions manually.
Why This Matters:In a production environment, you cannot verify every answer an AI gives. You need a suite of automated tests that runs every night. If a score drops, an alarm should go off. This is how you build trust in AI systems.
Tomorrow: We tackle one of the most requested features: Conversational Memory. We will teach your RAG system to remember what you said five minutes ago so you can ask follow-up questions.