Day 43 of 80

Hybrid Search

Phase 5: RAG Systems

What You'll Build Today

Welcome to Day 43! Today, we are going to solve one of the most frustrating problems in AI search systems: the trade-off between "exactness" and "meaning."

By now, you know that vector search (semantic search) is amazing at understanding context. But have you noticed it can be a bit... fuzzy? Sometimes you want to find a specific error code, a part number, or a proper noun, and vector search tries too hard to find the "vibe" rather than the exact word.

Today, we will build a Hybrid Search Engine.

Here is what you will master:

* BM25 (Best Matching 25): You will learn the industry-standard algorithm for keyword scoring (why "term frequency" matters).

* Score Normalization: You will learn why you cannot simply add a keyword score (like 15.5) to a vector score (like 0.88) without math to balance them.

* Alpha Weighting: You will build a system that lets you slide a dial between "Keyword Heavy" and "Semantic Heavy" to find the perfect balance.

* Reciprocal Rank Fusion (Concept): You will understand how modern databases combine these lists.

Let's make your search engine smarter and more precise.

The Problem

Let's look at a scenario that drives developers crazy.

Imagine you are building a search tool for an IT support team. You have a database of technical issues. You have two specific documents:

"The internet connection is unstable and drops frequently."

"Error-404: The requested resource could not be found on the server."

You have two different users with two different queries.

User A searches for: "wifi keeps disconnecting" User B searches for: "Error-404"

If you use only Keyword Search:

* User A fails. The document says "internet connection," not "wifi." Keyword search sees zero overlap.

* User B succeeds. "Error-404" is an exact match.

If you use only Semantic (Vector) Search:

* User A succeeds. The AI understands that "wifi disconnecting" means "unstable connection."

* User B might fail (or get low ranking). To an embedding model, "Error-404" looks like a generic number or computer term. It might return document #1 because "unstable connection" is semantically close to "server issues." It misses the specificity of the exact code.

Here is the code that demonstrates this frustration. We will use a simple list of sentences to simulate our database.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# We need a way to turn text into vectors. 
# For this "Problem" section, we will simulate the vector scores 
# to avoid needing an API key just to see the failure.

# Our Database
documents = [
    "The internet connection is unstable and drops frequently.", # Doc 0
    "Error-404: The requested resource could not be found."      # Doc 1
]

# Scenario 1: Keyword Search for "wifi"
# It looks for the exact string "wifi"
query_keyword = "wifi"
results_keyword = [doc for doc in documents if query_keyword in doc.lower()]
print(f"Keyword Search for '{query_keyword}': {results_keyword}") 
# Result: Empty list. Frustrating!

# Scenario 2: Semantic Search for "Error-404"
# Let's pretend we ran embeddings. 
# Semantic models often treat numbers as generic concepts.
# It might think Doc 0 (connection issues) is 'semantically' similar to a server error.
# Let's say the model gives these scores:
score_doc_0 = 0.85 (High, because it's tech related)
score_doc_1 = 0.82 (Lower, because numbers confuse it)

print(f"Semantic Search for 'Error-404' prefers: Doc 0 (Score {score_doc_0})")
# Result: It returns the WRONG document because it ignored the exact ID match.

This is the pain point. Keyword search is dumb but precise. Semantic search is smart but fuzzy. We need both.

Let's Build It

We are going to implement a Hybrid Search system from scratch. We will use rank_bm25 for the keyword part and OpenAI for the semantic part.

Prerequisites

You will need to install the BM25 library and OpenAI.

``bash


pip install rank_bm25 openai numpy scikit-learn



Step 1: Setup Data and Embeddings

First, let's create a small dataset that contains both generic concepts and specific identifiers (like "SKU-999"). We will also set up a helper function to get embeddings.

import os
import numpy as np
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity

# Initialize OpenAI Client
# Make sure your API key is set in your environment variables
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# Our Knowledge Base: A mix of generic descriptions and specific codes
documents = [
    "The glowing orb allows for night vision. SKU-999",    # Doc 0
    "A standard wooden chair with four legs.",             # Doc 1
    "Wireless headphones with noise cancellation.",        # Doc 2
    "SKU-999 is currently out of stock.",                  # Doc 3
    "The battery life of the orb is 24 hours."             # Doc 4
]

def get_embedding(text):
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

# Pre-calculate embeddings for our database
# In a real app, this would be stored in a Vector DB
doc_embeddings = [get_embedding(doc) for doc in documents]
doc_embeddings = np.array(doc_embeddings)

print("Database indexed and embedded.")


Step 2: Implement Keyword Search (BM25)

BM25 is an algorithm that ranks documents based on how often query words appear, but it penalizes words that are too common (like "the" or "and").

We need to "tokenize" our text (break sentences into lists of words) before feeding it to BM25.

from rank_bm25 import BM25Okapi

# 1. Tokenize the corpus (split into words)
# We simply split by space and lowercase for this demo
tokenized_corpus = [doc.lower().split(" ") for doc in documents]

# 2. Initialize BM25 with our corpus
bm25 = BM25Okapi(tokenized_corpus)

def get_bm25_scores(query, bm25_obj):
    # Tokenize the query the same way we did the documents
    query_tokens = query.lower().split(" ")
    # Get scores
    scores = bm25_obj.get_scores(query_tokens)
    return scores

# Test it
test_query = "SKU-999"
keyword_scores = get_bm25_scores(test_query, bm25)

print(f"Query: {test_query}")
print("BM25 Scores:", keyword_scores)
# Notice: Docs containing 'SKU-999' have high scores. Others have 0.


Step 3: Implement Semantic Search

This is the vector search we have done before. We embed the query and calculate cosine similarity against our document embeddings.

def get_semantic_scores(query, doc_embeddings):
    # 1. Embed the query
    query_vec = np.array(get_embedding(query)).reshape(1, -1)
    
    # 2. Calculate Cosine Similarity
    # result is a list of lists, so we flatten it to a simple 1D array
    scores = cosine_similarity(query_vec, doc_embeddings).flatten()
    return scores

# Test it
semantic_scores = get_semantic_scores(test_query, doc_embeddings)

print(f"Query: {test_query}")
print("Semantic Scores:", semantic_scores)
# Notice: All docs have SOME score (e.g., 0.1, 0.2) because vectors are never exactly zero.


Step 4: The Normalization Problem

Here is the catch.

Look at your previous outputs.

* BM25 scores might look like: [0, 0, 2.5, 0] (They can go very high depending on document length).


*   Semantic scores are always between -1 and 1 (usually 0.7 to 0.9 for OpenAI).

If you add 2.5 + 0.8, the BM25 score completely dominates. The semantic score becomes irrelevant. We need to Normalize both sets of scores so they are both on a scale of 0.0 to 1.0.



We will use "Min-Max Scaling."

def normalize_scores(scores):
    # If all scores are effectively zero, return as is to avoid division by zero
    if np.max(scores) == np.min(scores):
        return scores
        
    # Formula: (x - min) / (max - min)
    return (scores - np.min(scores)) / (np.max(scores) - np.min(scores))

# Let's see the difference
norm_bm25 = normalize_scores(keyword_scores)
norm_semantic = normalize_scores(semantic_scores)

print("Normalized BM25:", norm_bm25)
print("Normalized Semantic:", norm_semantic)
# Now both arrays range from 0.0 to 1.0. We can combine them safely!


Step 5: Hybrid Fusion

Now we combine them. We use a variable often called alpha (or weight).

* alpha = 1.0 means 100% Semantic Search.

* alpha = 0.0 means 100% Keyword Search.

* alpha = 0.5 means equal weight.



def hybrid_search(query, alpha=0.5):
    print(f"\n--- Hybrid Search: '{query}' (Alpha: {alpha}) ---")
    
    # 1. Get raw scores
    raw_bm25 = get_bm25_scores(query, bm25)
    raw_semantic = get_semantic_scores(query, doc_embeddings)
    
    # 2. Normalize
    norm_bm25 = normalize_scores(raw_bm25)
    norm_semantic = normalize_scores(raw_semantic)
    
    # 3. Weighted Sum
    # Formula: (1 - alpha)  Keyword + (alpha)  Semantic
    final_scores = (1 - alpha)  norm_bm25 + alpha  norm_semantic
    
    # 4. Print Results sorted by score
    # Create a list of (index, score) tuples
    results = list(enumerate(final_scores))
    
    # Sort descending by score
    results.sort(key=lambda x: x[1], reverse=True)
    
    for idx, score in results:
        print(f"Score: {score:.4f} | Doc: {documents[idx]}")

# Run the magic
# Case 1: Searching for a specific code (Needs Keywords)
hybrid_search("SKU-999", alpha=0.3) 

# Case 2: Searching for a concept (Needs Semantic)
hybrid_search("something to sit on", alpha=0.8) 


Run this code. Notice how in Case 1, we lower the alpha (0.3) to favor the exact match of the SKU. In Case 2, we raise the alpha (0.8) because "something to sit on" requires understanding that a "chair" is for sitting, even though the word "sit" isn't in the chair document.

Now You Try

You have a working hybrid engine. Now, let's experiment with it.

 The "Out of Vocabulary" Test:

Add a new document: "The device creates a luminous glow."

Search for "glowing orb".

Run the search with alpha=0.0 (pure keyword) vs alpha=1.0 (pure semantic).


    Observation: Keyword search might miss the new document because it lacks the word "orb" or "glowing" exactly, but semantic search should catch "luminous glow" as similar.

 The Acronym Extension:

Add a document: "NASA explores space."

Add another: "National Aeronautics and Space Administration headquarters."

Search for "NASA".


    Adjust alpha to see how the exact match (NASA) compares to the spelled-out version.

 The Alpha Loop:

Write a loop that runs the same query ("SKU-999") five times, changing alpha from 0.0, 0.25, 0.5, 0.75, to 1.0. Print the top result for each. Watch how the winner changes as you slide from keyword to semantic.



Challenge Project: The Tunable Medical Search

In medical data, precision is life-or-death. You cannot hallucinate a drug name, but you also need to match vague symptoms.

Your Task:
Create a search system for a mini medical database.

Requirements:
 Database: Create a list of 5-6 strings containing drug names (e.g., "Aspirin", "Ibuprofen") and symptom descriptions (e.g., "relieves headaches", "reduces inflammation").

The Slider Function: Create a function find_best_match(query) that automatically runs the search with three different alphas: 0.2 (Strict), 0.5 (Balanced), and 0.8` (Loose).

Output: For a single query, print the top result for each of the three alpha settings side-by-side.

Test Queries:

* Query 1: "Ibuprofen" (Should require low alpha/strict match).

* Query 2: "My head hurts" (Should require high alpha/semantic match).

Example Output:

Query: "Ibuprofen"
Strict (0.2): Ibuprofen (Score 0.98)
Balanced (0.5): Ibuprofen (Score 0.90)
Loose (0.8): Aspirin (Score 0.85) <-- Note: Semantic might confuse similar drugs!

Query: "My head hurts"
Strict (0.2): [No good match]
Balanced (0.5): Aspirin (relieves headaches)
Loose (0.8): Aspirin (relieves headaches)

Hint: Pay attention to your normalization function. If BM25 returns all zeros (no keyword match), make sure your code handles that gracefully without crashing.

What You Learned

Today you tackled the "Vocabulary Mismatch Problem." You learned that vector search isn't a silver bullet—it struggles with exact terms, IDs, and jargon.

* BM25: Uses term frequency to find exact keyword matches.

* Normalization: You cannot compare apples (keyword scores) to oranges (cosine similarity) without scaling them to the same range (0-1).

* Hybrid Search: Combining the precision of keywords with the understanding of vectors gives you the best of both worlds.

Why This Matters:

In a real RAG system (Retrieval Augmented Generation), retrieval is the most important step. If you don't find the right document, the LLM cannot answer the question. Hybrid search is the industry standard for ensuring you catch both the "vibe" and the "facts."

Tomorrow: We will look at Re-ranking. Sometimes even hybrid search returns a top 10 list that isn't quite in the right order. We will learn how to use a specialized AI model to act as a "judge" and re-sort the final list for perfect accuracy.

← Day 42 Day 44 →