Day 42 of 80

Retrieval Strategies

Phase 5: RAG Systems

What You'll Build Today

Welcome to Day 42! We are deep in Phase 5, building Retrieval Augmented Generation (RAG) systems.

Yesterday, you built your first RAG pipeline. You successfully retrieved documents and fed them to an LLM. But you might have noticed something annoying: the system isn't very discerning. If you ask a question, it often grabs the first few things it finds, even if they are repetitive or barely relevant.

Today, we are going to upgrade your retrieval engine from "dumb keyword matching" to "smart strategic selection."

Here is what you will master today:

* The "Echo Chamber" Problem: Why basic similarity search often fails by returning five versions of the exact same sentence.

Maximal Marginal Relevance (MMR): A strategy to force your AI to look for diverse* information, not just repetitive information.

* Score Thresholds: How to teach your AI to say "I don't know" instead of hallucinating an answer when the retrieved data is irrelevant.

* Top-k Selection: How to balance how many documents you scan versus how many you actually send to the LLM.

You are going to build a retrieval lab that visualizes the difference between "dumb" search and MMR search side-by-side.

The Problem

Let's start with the pain. When you use standard Similarity Search (which is what we did yesterday), the vector database calculates the distance between your query and every chunk of text in your database. It then returns the top results.

This sounds perfect, but it has a fatal flaw: Redundancy.

Imagine you have a document about a new smartphone, the "Phone X." You chopped it into chunks. Several chunks might discuss the battery life.

If a user asks: "How is the battery on the Phone X?"

A standard similarity search might return these three chunks:

"The Phone X has a 24-hour battery life." (Score: 0.95)

"Battery life on the Phone X is rated for 24 hours." (Score: 0.94)

"You can expect 24 hours of battery from the Phone X." (Score: 0.93)

The database did its job perfectly. These are technically the most similar chunks. But they are useless to the LLM. You just wasted your context window telling the LLM the same fact three times. You missed the chunk about wireless charging because it had a score of 0.89.

Let's look at code that creates this frustration.

import os
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.docstore.document import Document

# Set your API key
os.environ["OPENAI_API_KEY"] = "sk-..." # Replace with your key

# 1. Create a dataset with intentional redundancy
# Notice how documents 1, 2, and 3 are basically saying the same thing.
texts = [
    "The Galaxy Quest 9 has a battery that lasts 24 hours.",
    "Battery life on the Galaxy Quest 9 is rated for a full day (24 hours).",
    "You can expect 24 hours of usage from the Galaxy Quest 9 battery.",
    "The Galaxy Quest 9 supports 50W fast wireless charging.",
    "The camera on the Galaxy Quest 9 is 200 megapixels.",
    "The screen is a 6.8 inch OLED display."
]

# Convert to Document objects
docs = [Document(page_content=t) for t in texts]

# 2. Create the Vector Store
embeddings = OpenAIEmbeddings()
db = Chroma.from_documents(docs, embeddings)

# 3. The Painful Query
query = "Tell me about the battery features of Galaxy Quest 9"

# 4. Standard Similarity Search
print(f"--- Query: {query} ---")
print("--- Standard Similarity Search Results ---")
results = db.similarity_search(query, k=3)

for i, doc in enumerate(results):
    print(f"Result {i+1}: {doc.page_content}")

The Frustrating Output:

You will likely see the first three sentences about the 24-hour life. The search completely missed the "50W fast wireless charging" fact because the text "24 hours" was mathematically closer to the query than "wireless charging."

Your LLM will now answer the user: "The battery lasts 24 hours." It will say nothing about charging speed. This is a bad user experience.

There has to be a way to tell the database: "Get me the most relevant chunk, but for the next chunks, find me something new."

Let's Build It

We are going to solve this using MMR (Maximal Marginal Relevance).

MMR works in two steps:

Fetch: It grabs a larger pool of relevant documents (e.g., top 10).

Select: It picks the best one. Then, it looks at the remaining 9 and picks the one that is relevant to the query BUT dissimilar to the one already picked.

It balances Relevance vs. Diversity.

Step 1: Setup and Data Creation

We will use the same setup as the problem section, but we will wrap it in a clean script so we can iterate.

import os
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.docstore.document import Document

# 1. Setup Data
# We create a mix of redundant info and distinct info
texts = [
    "The Galaxy Quest 9 has a battery that lasts 24 hours.",
    "Battery life on the Galaxy Quest 9 is rated for a full day (24 hours).",
    "You can expect 24 hours of usage from the Galaxy Quest 9 battery.",
    "The Galaxy Quest 9 supports 50W fast wireless charging.",
    "The Galaxy Quest 9 takes 30 minutes to charge from 0 to 100%.",
    "The camera on the Galaxy Quest 9 is 200 megapixels.",
    "The screen is a 6.8 inch OLED display."
]

docs = [Document(page_content=t) for t in texts]
embeddings = OpenAIEmbeddings()

# We use a temporary collection name to ensure we start fresh
db = Chroma.from_documents(docs, embeddings, collection_name="mmr_test_lab")

print("Database created with redundant data.")

Step 2: The Baseline (Similarity Search)

Let's run the standard search again to establish our baseline. We want to retrieve 3 documents (k=3).

query = "Tell me about the battery performance and charging"

print(f"\nQUERY: '{query}'")
print("-" * 40)
print("STRATEGY: Basic Similarity Search (k=3)")
print("-" * 40)

# Basic search
basic_results = db.similarity_search(query, k=3)

for i, doc in enumerate(basic_results):
    print(f"[{i+1}] {doc.page_content}")

Expected Result: You will likely see the three sentences about "24 hours." The user asked about "charging," but the "24 hours" chunks overpowered the results.

Step 3: Implementing MMR

Now we switch to MMR. In LangChain, this is exposed via max_marginal_relevance_search.

It takes a few extra arguments:

* k: How many final documents you want (e.g., 3).

* fetch_k: How many documents to initially analyze (e.g., 10). We cast a wide net first.

* lambda_mult: The diversity slider.

* 1.0 = Pure similarity (same as basic search).

* 0.0 = Maximum diversity (might pick random irrelevant things).

* 0.5 = Balanced (default).

print("\n" + "-" * 40)
print("STRATEGY: MMR Search (k=3, fetch_k=10)")
print("-" * 40)

# MMR Search
# We fetch 10 candidates, but only return the top 3 diverse ones
mmr_results = db.max_marginal_relevance_search(
    query,
    k=3,
    fetch_k=10,
    lambda_mult=0.5
)

for i, doc in enumerate(mmr_results):
    print(f"[{i+1}] {doc.page_content}")

Expected Result:

"The Galaxy Quest 9 has a battery that lasts 24 hours." (Most relevant)

"The Galaxy Quest 9 supports 50W fast wireless charging." (Relevant, but different from #1)

"The Galaxy Quest 9 takes 30 minutes to charge..." (Relevant, different from #1 and #2)

Notice how the other two "24 hour" sentences were skipped? That is MMR in action.

Step 4: Similarity Score Thresholds

There is another problem. What if the user asks: "How do I bake a cake?"

Your database only knows about phones. But vector search always returns results. It will return the phone battery info because, mathematically, that is the "least distant" text to the cake question (even if the distance is huge).

This causes hallucinations. We need a Threshold.

irrelevant_query = "How do I bake a chocolate cake?"

print(f"\nQUERY: '{irrelevant_query}'")
print("-" * 40)
print("STRATEGY: Similarity with Score Threshold (0.7)")
print("-" * 40)

# similarity_search_with_relevance_scores returns (Document, score) tuples
# Scores range from 0 to 1 (1 is perfect match)
results_with_scores = db.similarity_search_with_relevance_scores(
    irrelevant_query,
    k=3,
    score_threshold=0.7 # If similarity is below 0.7, ignore it
)

if not results_with_scores:
    print("No relevant documents found. (This is good!)")
else:
    for doc, score in results_with_scores:
        print(f"Score: {score:.2f} - {doc.page_content}")

Note on Scores: Different vector stores use different distance metrics (Cosine, Euclidean). In LangChain/Chroma/OpenAI, a score of roughly 0.7 to 0.8 is usually a good cutoff for "relevant."

Step 5: Clean Up

Vector stores persist data. It is good practice to clean up your test collection.

db.delete_collection()
print("\nTest collection deleted.")

Now You Try

You have the code. Now, push the boundaries.

The Diversity Slider:

Modify the lambda_mult parameter in the MMR step.

* Set it to 0.9. Does it start behaving like basic search again?

* Set it to 0.1. Does it start pulling in the camera info even though we asked about battery? (Too much diversity can be distracting).

The Context Window Test:

Create a new query: "Tell me everything about the phone."

Set k=2. Compare Basic Search vs MMR.

Basic Search might give you two battery facts. MMR should give you one battery fact and one camera fact. This is crucial for "summary" type queries.

The Threshold Tuner:

Add a sentence to your texts list: "I love eating chocolate cake."

Re-run the "How do I bake a cake?" query.

Find the exact score_threshold number where the system stops returning "Phone" info but does return the "I love eating chocolate cake" sentence.

Challenge Project: The Semantic Failure

We often praise Semantic Search (vectors) as magical, but it has a weakness: Exact Keyword Matching.

If you search for a specific ID number, vectors often fail. Vectors understand that "dog" is close to "cat," but they struggle to understand that "ID-998" is totally different from "ID-999."

Your Goal: Create a script that demonstrates a case where Semantic Search fails to find the right document, but a simple keyword search would have succeeded. Requirements:

Create a list of documents describing products with specific SKU codes (e.g., "The Widget Pro is SKU-100", "The Widget Lite is SKU-101").

Make the descriptions extremely similar, differing only by the SKU.

Query the system for a specific SKU (e.g., "Tell me about SKU-101").

Print the similarity scores.

The Win Condition: Show that the vector search might return SKU-100 first (or with a very high score) simply because the sentence structure is identical, confusing the user.

Hint: OpenAI's embeddings are very good, so this is actually hard to "break." Try using very short documents where the only difference is one number. Example:

* Doc A: "Project Alpha ID: 5558"

* Doc B: "Project Alpha ID: 5559"

* Query: "Status of 5559"

What You Learned

Today you moved beyond "Hello World" retrieval. You learned that retrieving more isn't always better; retrieving smarter is key.

* Similarity Search: Good for finding general matches, bad at handling redundancy.

* MMR (Maximal Marginal Relevance): The fix for redundancy. It forces the system to fetch diverse perspectives on the topic.

* Thresholds: The guardrails that prevent your AI from answering questions it has no data for.

Why This Matters:

In production, your users won't ask perfectly phrased questions. They will ask vague things. If you use basic search, your RAG system will likely fill the context window with 5 variations of the same paragraph, leaving no room for the actual answer. MMR ensures your LLM gets a complete picture of the topic.

Tomorrow:

We have seen where Semantic Search wins, and we have hinted at where it fails (the Challenge Project). Tomorrow, we combine them. We will build Hybrid Search—combining the precision of keyword search (BM25) with the understanding of semantic search. Best of both worlds.

← Day 41 Day 43 →