Day 44 of 80

Re-Ranking for Quality

Phase 5: RAG Systems

What You'll Build Today

Welcome to Day 44! You have already built a functional RAG (Retrieval-Augmented Generation) system. You know how to chop up documents, store them as vectors, and retrieve them. But you have probably noticed something annoying: sometimes, your system retrieves "technically correct" documents that don't actually answer the user's question.

Today, we are going to fix that by building a Re-Ranking Pipeline.

Think of your current vector search as a fast librarian who sprints through the aisles and grabs 20 books that might be relevant based on the cover. Today, we are hiring a specialized researcher to sit down, read those 20 books carefully, and hand you the best 3.

Here is what you will master today:

* Two-Stage Retrieval: Why relying solely on vector search (bi-encoders) is often too inaccurate for production apps.

* The Cross-Encoder: A different type of AI model that is slower but significantly smarter at judging relevance.

* Cohere Rerank API: How to use an industry-standard tool to re-order your search results instantly.

* Precision vs. Recall: The strategy of casting a wide net first (high recall), then filtering for quality (high precision).

Let's turn your "okay" search engine into an "excellent" answer engine.

---

The Problem

Let's look at why vector search (what we have used so far) sometimes fails.

Vector search uses Bi-Encoders. It turns the query into a list of numbers (a vector) and the document into a list of numbers. It compares the numbers. This is incredibly fast, allowing you to search millions of documents in milliseconds.

However, in this compression process, a lot of nuance is lost. Vector search is great at finding general topics, but bad at understanding specific relationships or negation.

Imagine you have a document about "Python" (the snake) and "Python" (the code). If you search "How to feed a Python?", vector search might return coding tutorials because the word overlap is high, or it might rank the snake document at #10 because the vectors for "feed" and "input data" are mathematically close in some contexts.

If your LLM only reads the top 3 results, and the actual answer is at #7, your RAG system fails.

Here is a simulation of this frustration. We will use a tiny dataset of sentences. We want to find out what to do if it rains, but vector search gets distracted by other weather terms.

Prerequisites:

You will need the sentence-transformers library for the "bad" search, and cohere for the fix.

``bash


pip install sentence-transformers cohere numpy



Here is the broken experience:

import numpy as np
from sentence_transformers import SentenceTransformer

# 1. The Setup: A simple knowledge base
documents = [
    "The sun is a star found at the center of the solar system.",
    "Rain boots are recommended when walking outside in a storm.", # The answer we want
    "Solar panels generate energy from the sun.",
    "The weather forecast predicts clear skies today.",
    "Umbrellas are useful for shielding against rain.", # Also a good answer
    "Wind turbines create electricity from air currents.",
    "Storm chasers drive vehicles to track tornadoes.",
    "Water is essential for all known forms of life."
]

# 2. The Query
query = "What should I wear for a storm?"

# 3. The "Old Way" - Vector Search (Bi-Encoder)
# We load a small model to simulate vector search
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode docs and query
doc_embeddings = model.encode(documents)
query_embedding = model.encode(query)

# Calculate similarity (Dot Product)
scores = np.dot(doc_embeddings, query_embedding)

# Sort results (highest score first)
# We use argsort, then reverse it to get descending order
ranked_indices = np.argsort(scores)[::-1]

print(f"Query: {query}\n")
print("--- Vector Search Results (Top 3 passed to LLM) ---")
for i in range(3):
    idx = ranked_indices[i]
    print(f"Rank #{i+1}: {documents[idx]} (Score: {scores[idx]:.4f})")

print("\n--- The Missed Opportunity ---")
# Let's look further down the list
for i in range(3, 6):
    idx = ranked_indices[i]
    print(f"Rank #{i+1}: {documents[idx]} (Score: {scores[idx]:.4f})")


The Pain:
Run this code. Depending on the specific model nuances, you will often see "Storm chasers drive vehicles..." or "The weather forecast..." ranked very high because they share the word "storm" or "weather."

Meanwhile, "Rain boots are recommended..." might slip to rank #4 or #5.

If your RAG pipeline only sends the top 3 chunks to the LLM, the LLM will say: "I don't know what you should wear, but I know storm chasers drive vehicles."

That is a bad user experience. We need to fix the ranking.

---

Let's Build It

We will implement a Two-Stage Retrieval system.

 Stage 1 (Retrieval): Use vector search to get the top 20 results. This is our "wide net." We accept that some will be irrelevant.
 Stage 2 (Reranking): Use a Cross-Encoder (via Cohere) to analyze those 20 pairs deeply and score them accurately. We take the top 5 from this list.

Step 1: Get Your API Key
We will use Cohere because they have a specialized endpoint just for this.
 Go to [dashboard.cohere.com](https://dashboard.cohere.com).
 Sign up/Login.
 Create a Trial API Key.

Step 2: Define the Data and "Wide Net" Retrieval
We will reuse the data from the problem section, but let's assume we have already run the vector search and retrieved a large candidate list (e.g., top 10).

In a real app, this list comes from your Vector Database (Chroma, Pinecone, etc). Here, we will simulate that we retrieved 10 mixed-quality results.

import cohere

# Initialize the client
# REPLACE WITH YOUR ACTUAL API KEY
co = cohere.Client('YOUR_COHERE_API_KEY')

# The user query
query = "What should I wear for a storm?"

# The "Wide Net" - Imagine we retrieved these 10 from our Vector DB
# Note: I've mixed relevant answers with irrelevant ones that share keywords
retrieved_docs = [
    "Storm chasers drive vehicles to track tornadoes.", # Keyword match "storm"
    "The weather forecast predicts clear skies today.", # Semantic match "weather"
    "Solar panels generate energy from the sun.",       # Irrelevant
    "Rain boots are recommended when walking outside in a storm.", # THE ANSWER
    "Wind turbines create electricity from air currents.", # Irrelevant
    "Umbrellas are useful for shielding against rain.", # Relevant
    "Thunderstorms can cause power outages.",           # Context, but not the answer
    "Cotton clothes are breathable in summer.",         # Clothing, but wrong context
    "Winter coats are heavy and warm.",                 # Clothing, wrong context
    "Lightning strikes are dangerous during a storm."   # Context
]

print(f"Stage 1: Retrieved {len(retrieved_docs)} documents from Vector DB.")


Step 3: The Rerank

Now we call the rerank endpoint. This uses a Cross-Encoder.



Unlike a Bi-Encoder (which compares two frozen vectors), a Cross-Encoder takes the query and the document together as a single input and outputs a relevance score. It "reads" the query in the context of the document. It is much slower, which is why we only run it on 10 or 20 items, not the whole database.

# Call Cohere Rerank
# model="rerank-english-v3.0" is optimized for English queries
results = co.rerank(
    query=query,
    documents=retrieved_docs,
    top_n=3, # We only want the absolute best 3 for our LLM
    model="rerank-english-v3.0"
)

print("\nStage 2: Reranking complete.")


Step 4: Display the Results

The results object contains the new order and relevance scores. Let's see if it found the needle in the haystack.



print(f"\nQuery: {query}")
print("-" * 50)

for idx, result in enumerate(results.results):
    # result.index is the index in the original list
    # result.relevance_score is how confident the model is (0 to 1)
    
    original_doc = retrieved_docs[result.index]
    score = result.relevance_score
    
    print(f"Rank #{idx + 1} (Score: {score:.4f})")
    print(f"Content: {original_doc}")
    print("-" * 50)


Run this code.
You should see a dramatic improvement. "Rain boots..." and "Umbrellas..." should jump to the top, even if they weren't at the top of the original list. Notice the scores—they are calibrated relevance probabilities. A score of 0.99 means "I am almost certain this answers the question."

Step 5: Filtering by Score
Sometimes, even the best result in your list is bad. If you search for "How to bake a cake" in our weather dataset, the "best" match might be "Solar panels..." simply because it's the least unlike the query.

We should add a threshold. If the best score is below 0.5, we shouldn't send it to the LLM.

print("\n--- Filtering Bad Matches ---")

bad_query = "How do I bake a cake?"

bad_results = co.rerank(
    query=bad_query,
    documents=retrieved_docs,
    top_n=3,
    model="rerank-english-v3.0"
)

for result in bad_results.results:
    if result.relevance_score < 0.5:
        print(f"Skipping document (Score {result.relevance_score:.4f} is too low)")
    else:
        print(f"Keeping document: {retrieved_docs[result.index]}")


This prevents hallucinations. If the retrieval is bad, it's better to say "I don't know" than to answer using irrelevant data.

---

Now You Try

You have the basic logic. Now let's push it further.

 The "Top K" Experiment:

Modify the code to retrieve the top 5 results instead of 3. Change top_n=5. Observe how the relevance scores drop off as you go down the list. Is the difference between Rank 1 and Rank 5 large or small?



 JSON Data Handling:
    In the real world, your documents aren't just strings; they are objects with titles, dates, and URLs.
    Create a list of dictionaries:
        docs = [
        {"text": "Rain boots...", "source": "manual.pdf"},
        {"text": "Solar panels...", "source": "science_book.txt"}
    ]

Modify the rerank call to pass [d['text'] for d in docs] as the documents list, but when printing the results, use result.index to look up and print the source from your original dictionary.



 Latency Check:

Import the time library. Wrap the co.rerank call in a timer:


        import time
    start = time.time()
    # ... rerank call ...
    end = time.time()
    print(f"Reranking took {end - start} seconds")
    
    Compare this speed to the simple vector search from the "Problem" section. Notice the difference? This is why we don't rerank the whole database!

---

Challenge Project: The Accuracy Benchmark

Your manager wants to know if paying for Cohere Rerank is worth it. You need to prove it with data.

The Task:
Create a script that compares "First Result Accuracy" between standard Vector Search and Reranked Search.

Requirements:
 Create a dataset of 10 diverse sentences (mix of topics: food, tech, nature).
 Define 3 queries where the answer is tricky (requires understanding, not just keyword matching).
 For each query:

* Run standard Vector Search (using sentence-transformers). Get the Top 1 result.

* Run Reranking (using cohere on the top 10 vector results). Get the Top 1 result.


 Print a comparison table showing which method picked the correct sentence.

Example Input/Output:
Query: "What powers a computer?"
Correct Answer: "Electricity flows through circuits."

Vector Search picked: "Electric eels swim in water." (Incorrect)
Rerank Search picked: "Electricity flows through circuits." (Correct)

Score: Vector 0/1, Rerank 1/1


Hints:
*   You will need to manually define what the "Correct Answer" string is for your script to check against.

* Use if picked_sentence == correct_answer_sentence to score it automatically.



---

What You Learned

Today you moved from "Basic RAG" to "Production RAG."

*   Bi-Encoders (Vector Search): Fast, cheap, but sometimes shallow. Good for finding the "neighborhood" of the answer.
*   Cross-Encoders (Reranking): Slower, more expensive, but deeply intelligent. Good for finding the exact address.

* The Pipeline: Query -> Vector Search (Top 20) -> Rerank (Top 5) -> LLM`.

Why This Matters:

In a corporate setting, giving the wrong policy document to an employee is worse than giving no document. Reranking is the most effective way to increase the accuracy of your system without training your own custom AI models. It is the "easy button" for higher quality.

Tomorrow:

Even the best reranker fails if the user asks a garbage question. Tomorrow, we will cover Query Enhancement—how to use an LLM to rewrite the user's bad query into a good one before we even start searching.

← Day 43 Day 45 →