Day 48 of 80

Advanced RAG Patterns

Phase 5: RAG Systems

What You'll Build Today

Welcome to Day 48! We are deep in Phase 5, and you have already built a functional RAG (Retrieval Augmented Generation) system. You know how to chop text into chunks, store them in a vector database, and retrieve them to answer questions.

But here is the truth: Basic RAG works great for simple demos, but it often fails in the real world.

Today, we are moving from "it works" to "it works intelligently." We are going to build an Advanced RAG pipeline using Contextual Compression.

Here is what you will learn and why it matters:

* Contextual Compression:

Why:* Standard RAG retrieves entire chunks of text, even if only one sentence is relevant. This confuses the AI and wastes money on tokens. Solution: We will build a system that reads the retrieved documents and "compresses" them, extracting only the exact sentences needed to answer the query before* passing them to the final answer generator.

* Parent Document Retrieval:

Why:* Small chunks are great for searching (high precision) but bad for context (low understanding). Large chunks are great for context but bad for searching. Solution:* We will explore the concept of searching for small snippets but retrieving the larger "parent" document they belong to.

* Multi-modal RAG:

Why:* The world isn't just text. Sometimes you need to search for images based on their content. Solution:* You will learn how to use vision models to describe images so they can be retrieved by text queries.

Let's upgrade your RAG system.

The Problem

Imagine you are building a legal assistant. You have a massive contract that is 50 pages long. You ask: "What is the termination fee?"

A basic RAG system splits that contract into 1000-character chunks. It searches for "termination fee." It might find three chunks.

  • Chunk A: "...discussed the termination of the project in the context of..." (Irrelevant discussion)
  • Chunk B: "...fee schedules are attached in Appendix B..." (Vague reference)
  • Chunk C: "...the termination fee shall be $5,000." (The answer)
  • The basic RAG system takes all three chunks (3000 characters of text) and stuffs them into the prompt for the LLM.

    The Pain:
  • Cost: You are paying to process 3000 characters when you only needed 50.
  • Distraction: The LLM reads Chunk A and might get confused by the "discussion" and give a vague answer.
  • Context Limit: If you retrieve 10 documents, you might run out of space in the model's memory.
  • Here is code that demonstrates this inefficiency. We will create a document with a lot of noise and one hidden fact, then see how much "junk" a standard retriever pulls in.

    import os
    

    from langchain_community.vectorstores import FAISS

    from langchain_openai import OpenAIEmbeddings

    from langchain.text_splitter import CharacterTextSplitter

    from langchain_core.documents import Document

    # Setup API Key (Replace with your actual key)

    os.environ["OPENAI_API_KEY"] = "sk-..."

    # 1. Create a "Noisy" Document # This represents a long document where only one sentence matters.

    noise_text = "The weather in 2023 was unusually mild. " * 50

    important_text = "The secret code to the vault is 998877. "

    more_noise = "Banana bread recipes vary by region. " * 50

    full_text = noise_text + important_text + more_noise

    # 2. Split into chunks (Standard RAG approach)

    splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0, separator=". ")

    docs = [Document(page_content=x) for x in splitter.split_text(full_text)]

    # 3. Create Vector Store

    embeddings = OpenAIEmbeddings()

    vectorstore = FAISS.from_documents(docs, embeddings)

    retriever = vectorstore.as_retriever(search_kwargs={"k": 3}) # Retrieve top 3 chunks

    # 4. The Painful Retrieval

    query = "What is the secret code?"

    retrieved_docs = retriever.invoke(query)

    print(f"--- Query: {query} ---")

    print(f"Number of docs retrieved: {len(retrieved_docs)}")

    print("\n--- Content Passed to LLM (The Problem) ---")

    for i, doc in enumerate(retrieved_docs):

    print(f"Chunk {i+1} Length: {len(doc.page_content)} characters")

    print(f"Chunk {i+1} Content Start: {doc.page_content[:50]}...")

    The Output Analysis:

    When you run this, you will see that the retriever pulls in the chunk with the secret code, but it also likely pulls in neighbors filled with "Banana bread" or "Weather" noise. If you were paying per token, you just paid for a lot of banana bread.

    There has to be a way to filter the trash after retrieval but before generation.

    Let's Build It

    We are going to implement Contextual Compression.

    Think of this as hiring a smart intern. Instead of dumping 50 files on your desk (the LLM's desk), the intern (the Compressor) goes to the file cabinet (Vector Store), pulls the files, reads them, highlights only the relevant sentences, and puts a single sheet of paper on your desk.

    Step 1: Setup the Dependencies

    We need the langchain libraries. Ensure you have them installed: pip install langchain langchain-openai faiss-cpu.

    We will use the same data setup as the "Problem" section, but we will wrap the retriever in a compression pipeline.

    from langchain_openai import OpenAI
    

    from langchain.retrievers import ContextualCompressionRetriever

    from langchain.retrievers.document_compressors import LLMChainExtractor

    # Reuse the vectorstore and simple retriever from the 'Problem' section # If you skipped running that code, ensure 'vectorstore' is defined as above. # Define the base retriever

    base_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

    Step 2: Initialize the Compressor

    The "Compressor" is actually a smaller LLM call. It takes a document and a query, and its prompt instructions are essentially: "Extract the parts of this document that are relevant to the query. If nothing is relevant, return nothing."

    We use LLMChainExtractor for this.

    # We use a standard OpenAI model for the extraction work
    

    llm = OpenAI(temperature=0)

    # Create the compressor

    compressor = LLMChainExtractor.from_llm(llm)

    # Why temperature=0? # We want the model to be precise and factual, not creative. # We are extracting data, not writing a poem.

    Step 3: Build the Pipeline

    Now we combine the base_retriever (which finds the rough area) with the compressor (which refines the results).

    compression_retriever = ContextualCompressionRetriever(
    

    base_compressor=compressor,

    base_retriever=base_retriever

    )

    print("Pipeline created successfully.")

    Step 4: Execute the Compression

    Let's run the exact same query about the secret code. Watch how the output changes.

    query = "What is the secret code?"
    
    # Use the compression retriever instead of the base one
    

    compressed_docs = compression_retriever.invoke(query)

    print(f"--- Query: {query} ---")

    print(f"Number of docs retrieved: {len(compressed_docs)}")

    print("\n--- Content Passed to LLM (The Solution) ---")

    for i, doc in enumerate(compressed_docs):

    print(f"Chunk {i+1} Length: {len(doc.page_content)} characters")

    print(f"Chunk {i+1} Content: {doc.page_content}")

    What just happened?
  • The base_retriever grabbed 4 chunks (mostly banana bread and weather).
  • The compressor looked at the banana bread chunk, realized it had nothing to do with "secret code," and discarded it (or returned an empty string).
  • The compressor looked at the chunk containing "The secret code to the vault is 998877," extracted just that sentence, and discarded the surrounding text.
  • Your final output is likely just one short document containing the exact answer.
  • Step 5: Connecting to a Generation Chain

    Now, let's see how this improves the final answer generation. We will create a standard Q&A chain using our refined documents.

    from langchain.chains import RetrievalQA
    
    # Create the chain using the COMPRESSION retriever
    

    qa_chain = RetrievalQA.from_chain_type(

    llm=OpenAI(temperature=0),

    retriever=compression_retriever

    )

    response = qa_chain.invoke({"query": query})

    print("\n--- Final AI Answer ---")

    print(response['result'])

    Step 6: Understanding the Trade-off

    It is important to understand the cost.

    * Basic RAG: 1 Embedding call + 1 Retrieval + 1 LLM Generation call (with lots of tokens).

    * Compression RAG: 1 Embedding call + 1 Retrieval + N LLM calls (one for each retrieved chunk to compress it) + 1 LLM Generation call (with very few tokens).

    Contextual compression trades compute time (making more calls) for token reduction and accuracy. It is slower, but smarter.

    Now You Try

    You have the basic compression pattern running. Now, let's extend it.

  • Try a Different Filter:
  • Instead of LLMChainExtractor (which rewrites text), try using EmbeddingsFilter. This doesn't rewrite the text; it just drops chunks entirely if their similarity score isn't high enough.

    Hint:* Import EmbeddingsFilter from langchain.retrievers.document_compressors. It is faster and cheaper because it doesn't use an LLM for filtering.
  • The "Parent Document" Simulator:
  • Create a dataset where the chunks are very small (e.g., 50 characters).

    Modify your code to store the source of the chunk (metadata).

    When you retrieve a chunk, write a Python function that prints: "Found chunk [text]... coming from Document ID [id]."

    Goal:* Understand that the chunk is just a pointer to a larger reality.
  • Token Savings Calculator:
  • Write a script that runs the base_retriever and the compression_retriever side-by-side.

    Calculate the total length of strings returned by both.

    Print: "Original Context Size: X chars. Compressed Context Size: Y chars. Savings: Z%."

    Challenge Project: Image RAG

    Advanced RAG isn't just about text. A major pain point in technical fields is searching diagrams or screenshots. Since we cannot easily "embed" an image directly without complex multi-modal vector stores, we use a clever workaround: Describe, Embed, Retrieve.

    The Challenge:

    Build a system that allows you to "search" a folder of images using text.

    Requirements:
  • The Setup: Assume you have a local image file (e.g., a picture of a chart, a cat, or a landscape).
  • The Describer: Use GPT-4o (or GPT-4 Vision) to generate a detailed text description of the image.
  • The Storage: Store that text description in your Vector Store, but keep the file path to the image in the metadata.
  • The Retrieval: Query the system (e.g., "Find the chart showing Q3 growth").
  • The Result: The system should find the text description, look up the file path in the metadata, and print: "Found image at: ./images/chart_q3.png".
  • Example Input/Output: Input Image:* red_ferrari.jpg Generated Description:* "A shiny red sports car driving on a coastal highway..." User Query:* "Show me pictures of fast cars." System Output:* "Result found: red_ferrari.jpg (Based on description: 'A shiny red sports car...')" Hints: You are not embedding the image pixels. You are embedding the story* of the image.

    * Use langchain.schema.Document to store the description as page_content and the filename as metadata={"source": "red_ferrari.jpg"}.

    What You Learned

    Today you moved beyond the "Hello World" of RAG systems. You tackled the messy reality of retrieving information.

    Precision vs. Recall: You learned that getting everything* (High Recall) is often bad for the LLM. You need specific things (High Precision).

    * Contextual Compression: You built a pipeline that reads and filters documents before the final answer is generated, saving tokens and reducing hallucinations.

    * Multi-modal Concepts: You explored how to make non-text data (images) searchable by converting them into text descriptions first.

    * GraphRAG (Concept): While we didn't code it, you learned that the next frontier is mapping relationships (nodes and edges) rather than just text similarity.

    Why This Matters:

    In a production application, users ask vague questions. If your RAG system dumps 10 pages of manual text into the chat window, the user will leave. If it gives a concise, accurate answer derived from those 10 pages, the user will trust it.

    Tomorrow:

    We put it all together. You have the components—loaders, splitters, vector stores, compressors. Tomorrow, we build a Production-Grade RAG Application with a proper user interface. Get ready to ship.