Day 46 of 80

RAG Evaluation

Phase 5: RAG Systems

Here is the content for Day 46.

What You'll Build Today

You have spent the last few days building Retrieval Augmented Generation (RAG) systems. You know how to chop up documents, store them in vector databases, and make an LLM answer questions based on that data.

But here is the hard question: Is your RAG system actually good?

When you ask it a question, does it find the right document chunk? Does the answer actually come from that chunk, or is the AI hallucinating? Today, we are moving from "it feels right" to "here is the data."

You will build an Automated RAG Evaluation Suite.

Here is what you will learn:

* The Ragas Library: A specialized toolkit designed specifically to grade RAG applications.

* Retrieval Metrics: How to measure if your system found the correct document (Context Precision and Recall).

* Generation Metrics: How to measure if the AI's answer is factually supported by the text (Faithfulness) and if it actually answered the user's question (Answer Relevance).

* Synthetic Test Data: How to use an LLM to generate test questions for you, so you don't have to write them manually.

Let's turn your AI development into a science.

The Problem

Imagine you have built a RAG chatbot for a company's HR policy. You launch it, and your manager asks, "How accurate is it?"

You say, "Well, I asked it five questions and it looked pretty good."

Your manager says, "We have 500 pages of policy. What happens if we change the chunk size from 1000 to 500? Does it get better or worse?"

To answer this, you would have to manually read the documents, ask a question, read the AI's answer, and grade it yourself. Then you would have to do that 50 or 100 times to get a statistically significant result.

Here is what that manual workflow looks like in code. It is painful.

# The "Painful" Manual Evaluation

questions = [

"What is the vacation policy?",

"How do I claim expenses?",

"What are the core working hours?"

]

# Imagine this is your RAG system

def simple_rag(question):

return "Here is a generated answer..."

results = []

print("Starting Manual Evaluation... get some coffee, this will take forever.")

for q in questions:

# 1. Run the RAG

answer = simple_rag(q)

# 2. Show it to the human (YOU)

print(f"\nQuestion: {q}")

print(f"AI Answer: {answer}")

# 3. Ask human to grade it

is_correct = input("Is this accurate? (y/n): ")

results.append(1 if is_correct.lower() == 'y' else 0)

accuracy = sum(results) / len(results)

print(f"\nFinal Accuracy: {accuracy * 100}%")

Why this fails:
  • It doesn't scale. You cannot do this for 100 questions every time you update your code.
  • It's subjective. You might grade leniently on Monday and strictly on Friday.
  • It's incomplete. "Accurate" is vague. Did it find the right document but give a bad summary? Or did it find the wrong document entirely? A simple "y/n" doesn't tell you what broke.
  • There has to be a way to automate this. We need an "LLM-as-a-Judge"—using a smart AI (like GPT-4) to grade the output of our RAG system based on specific mathematical criteria.

    Let's Build It

    We will use a library called Ragas (Retrieval Augmented Generation Assessment). It provides standard metrics to score your system.

    Step 1: Setup and Installation

    We need ragas, langchain, and datasets (a library by HuggingFace to manage data).

    Note: You will need your OpenAI API key set in your environment variables.
    import os
    

    from langchain_openai import ChatOpenAI, OpenAIEmbeddings

    from langchain_community.vectorstores import Chroma

    from langchain.text_splitter import RecursiveCharacterTextSplitter

    from langchain.schema.runnable import RunnablePassthrough

    from langchain.schema.output_parser import StrOutputParser

    from langchain.prompts import ChatPromptTemplate

    from datasets import Dataset

    # Set your API key here

    os.environ["OPENAI_API_KEY"] = "sk-..."

    # We will use GPT-4 for evaluation (the judge needs to be smart) # and GPT-3.5 or GPT-4 for the RAG system itself.

    llm = ChatOpenAI(model="gpt-3.5-turbo")

    evaluator_llm = ChatOpenAI(model="gpt-4")

    Step 2: Create a Mini-RAG System

    To test a RAG system, we first need to build one. We will use a simple text about the Solar System as our knowledge base.

    # 1. The Knowledge Base
    

    text = """

    The Solar System is the gravitationally bound system of the Sun and the objects that orbit it.

    It formed 4.6 billion years ago from the gravitational collapse of a giant interstellar molecular cloud.

    The vast majority of the system's mass is in the Sun, with the majority of the remaining mass contained in Jupiter.

    The four inner system planets—Mercury, Venus, Earth, and Mars—are terrestrial planets, being composed primarily of rock and metal.

    The four giant planets of the outer system are substantially more massive than the terrestrials.

    The two largest, Jupiter and Saturn, are gas giants, being composed mainly of hydrogen and helium;

    the two outermost planets, Uranus and Neptune, are ice giants, being composed mostly of substances with relatively high melting points compared with hydrogen and helium, called volatiles, such as water, ammonia, and methane.

    """

    # 2. Chunking and Indexing

    splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)

    chunks = splitter.create_documents([text])

    vectorstore = Chroma.from_documents(

    documents=chunks,

    embedding=OpenAIEmbeddings()

    )

    retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

    # 3. The RAG Chain

    template = """Answer the question based only on the following context:

    {context}

    Question: {question}

    """

    prompt = ChatPromptTemplate.from_template(template)

    rag_chain = (

    {"context": retriever, "question": RunnablePassthrough()}

    | prompt

    | llm

    | StrOutputParser()

    )

    print("RAG System built successfully.")

    Step 3: Preparing the Evaluation Data

    Ragas requires four columns of data to perform a full evaluation:

  • question: What the user asked.
  • answer: What your RAG generated.
  • contexts: The actual text chunks your retriever found.
  • ground_truth: The "correct" answer (the teacher's answer key).
  • Let's manually create a small test set.

    # Our test questions and the "correct" answers we expect
    

    test_questions = [

    "How old is the solar system?",

    "What are Uranus and Neptune composed of?",

    "What is the largest planet?" # Note: The text mentions Jupiter has mass, but implies size. Let's see how it handles it.

    ]

    ground_truths = [

    ["It formed 4.6 billion years ago."],

    ["They are ice giants composed mostly of water, ammonia, and methane."],

    ["Jupiter."]

    ]

    data = {

    "question": [],

    "answer": [],

    "contexts": [],

    "ground_truth": []

    }

    print("Running inference on test set...")

    for q, gt in zip(test_questions, ground_truths):

    # 1. Run the RAG to get the answer

    generated_answer = rag_chain.invoke(q)

    # 2. Retrieve the contexts manually so we can log them # (The chain does this internally, but we need to capture them for the evaluator)

    docs = retriever.get_relevant_documents(q)

    retrieved_contexts = [doc.page_content for doc in docs]

    # 3. Store data

    data["question"].append(q)

    data["answer"].append(generated_answer)

    data["contexts"].append(retrieved_contexts)

    data["ground_truth"].append(gt)

    # Convert to a HuggingFace Dataset object (required by Ragas)

    dataset = Dataset.from_dict(data)

    print("Data preparation complete. Here is a sample:")

    print(dataset[0])

    Step 4: Measuring "Faithfulness" and "Answer Relevance"

    Now we import the metrics.

    * Faithfulness: Does the answer stick to the context? (Score 0 to 1). If the context says "The sky is blue" and the AI says "The sky is green," faithfulness is 0.

    * Answer Relevance: Did the AI answer the actual question? If I ask "What is your name?" and the AI says "I like pizza," relevance is 0.

    * Context Precision: Did the retriever find the relevant chunk in the list of results?

    from ragas import evaluate
    

    from ragas.metrics import (

    faithfulness,

    answer_relevance,

    context_precision,

    )

    # We list the metrics we want to test

    metrics = [

    faithfulness,

    answer_relevance,

    context_precision,

    ]

    print("Starting evaluation (this calls OpenAI API multiple times)...")

    results = evaluate(

    dataset=dataset,

    metrics=metrics,

    )

    print("\n=== Evaluation Results ===")

    print(results)

    Step 5: Analyzing the Output

    The results object gives you an average score, but we often want to see the breakdown per question to debug specific failures. We can convert the results to a Pandas DataFrame.

    import pandas as pd
    
    # Convert results to a readable table
    

    df = results.to_pandas()

    # Display specific columns

    pd.set_option('display.max_colwidth', 50) # Don't truncate text too much

    display_cols = ['question', 'answer', 'faithfulness', 'answer_relevance', 'context_precision']

    print(df[display_cols])

    What to look for:

    * Low Faithfulness: Your LLM is hallucinating info not in the text.

    * Low Context Precision: Your retriever (vector database) is failing to find the right chunk. You might need to change your chunk size or k value.

    * Low Answer Relevance: Your prompt template might be confusing the LLM.

    Step 6: Generating Synthetic Test Data

    The hardest part of evaluation is coming up with the ground_truth questions and answers. Ragas can actually generate these for you by reading your documents!

    This uses the "TestsetGenerator".

    from ragas.testset.generator import TestsetGenerator
    

    from ragas.testset.evolutions import simple, reasoning, multi_context

    # 1. Initialize the generator with our generator and critic models

    generator = TestsetGenerator.with_openai()

    # 2. Generate questions based on our documents # We will generate 3 simple questions for this demo

    testset = generator.generate_with_langchain_docs(

    chunks,

    test_size=3,

    distributions={simple: 1.0} # 100% simple questions

    )

    synthetic_df = testset.to_pandas()

    print("\n=== Synthetic Test Data Generated ===")

    print(synthetic_df[['question', 'ground_truth']].head())

    Now you have a dataset automatically created from your own documents that you can use to test your system every time you make a change.

    Now You Try

    You have a working evaluation pipeline. Now, let's use it to run some experiments.

  • Break the Retrieval:
  • Go back to Step 2. Change the retriever search kwargs to k=1. This means the system only retrieves ONE chunk. Run the evaluation again. Does context_precision or faithfulness drop? (It should, because it might miss necessary context).

  • Test Hallucination:
  • Add a manual entry to your data dictionary in Step 3.

    * Question: "What is the capital of Mars?"

    * Answer: "The capital of Mars is Elon City." (Intentionally fake).

    * Contexts: [The solar system text...]

    * Ground Truth: ["Mars has no capital."]

    Run the evaluation. Look at the faithfulness score for that specific row. It should be very low.

  • Export Your Report:
  • Take the pandas DataFrame from Step 5 and save it to a CSV file named rag_report_card.csv. This is the file you would hypothetically email to your manager to prove the system works.

    Challenge Project

    The "Before & After" Experiment

    Your challenge is to prove that changing a setting improves your RAG system.

    Requirements:
  • Use a new text source (e.g., copy-paste a few paragraphs from a Wikipedia article about "Photosynthesis" or "Machine Learning").
  • Configuration A (Bad): Set up a RAG system with a very small chunk size (e.g., 50 characters) and k=1.
  • Configuration B (Good): Set up a RAG system with a reasonable chunk size (e.g., 500 characters) and k=3.
  • Generate a synthetic test set of 5 questions using the Ragas TestsetGenerator.
  • Run evaluate() on Configuration A.
  • Run evaluate() on Configuration B.
  • Print a comparison of the average scores.
  • Example Output:
    Running Config A (Bad)...
    

    Average Faithfulness: 0.65

    Average Context Precision: 0.40

    Running Config B (Good)...

    Average Faithfulness: 0.92

    Average Context Precision: 0.88

    Conclusion: Configuration B improved Context Precision by 120%.

    Hint: You will need to create two different vectorstore objects (one for each chunk size) to run this comparison properly.

    What You Learned

    Today you moved from "vibes-based" development to "metrics-based" engineering.

    * Faithfulness tells you if your AI is lying.

    * Answer Relevance tells you if your AI is helpful.

    * Context Precision tells you if your database search is working.

    * Synthetic Data saves you from writing hundreds of test questions manually.

    Why This Matters:

    In a production environment, you cannot verify every answer an AI gives. You need a suite of automated tests that runs every night. If a score drops, an alarm should go off. This is how you build trust in AI systems.

    Tomorrow: We tackle one of the most requested features: Conversational Memory. We will teach your RAG system to remember what you said five minutes ago so you can ask follow-up questions.