Day 39 of 80

Vector Databases: ChromaDB

Phase 5: RAG Systems

What You'll Build Today

Today, we are building the "long-term memory" for your AI applications.

Up until now, every time you ran a script, your data lived in variables. When the script finished, the data vanished. Furthermore, if you wanted to find a specific piece of text based on meaning (semantic search), you would have to manually compare your query against every single document you have.

Today, you will build a Document Search Engine using ChromaDB.

Here is what you will master:

* Vector Databases: You will learn why we need a specialized database for AI, rather than using a standard spreadsheet or SQL database.

* ChromaDB Setup: You will learn how to initialize a local vector database that lives on your computer.

* Collections: You will learn how to organize data into logical groups (like folders), so you don't mix up recipes with financial reports.

Metadata Filtering: You will learn how to tag your data (e.g., "Source: Email", "Date: 2023") so you can search for meaning and* filter by specific criteria simultaneously.

* Persistence: You will learn how to save this database to your hard drive so your AI remembers facts even after you turn off your computer.

Let's give your AI a brain that lasts.

---

The Problem

Imagine you have a list of 10,000 sentences (chunks) from a massive training manual. You want to find the one sentence that answers the question: "How do I reset the safety valve?"

In the previous lessons, we learned that we can convert text into numbers (vectors/embeddings) and compare them using math (cosine similarity).

Here is how you would have to do that with standard Python code.

The "Painful" Approach:

Take your question and convert it to numbers.

Create a loop that goes through all 10,000 sentences.

For each sentence, calculate the math similarity score.

Sort the list of 10,000 scores.

Pick the top one.

Here is what that code looks like. You do not need to run this, just look at the complexity and imagine the speed issues.

import numpy as np
import time

# 1. Simulate 10,000 documents (just random numbers for this example)
# In reality, these would be embeddings from OpenAI
database_size = 10000
vector_dim = 1536 # Size of an OpenAI embedding
document_vectors = np.random.rand(database_size, vector_dim)

# 2. Simulate a user query vector
query_vector = np.random.rand(vector_dim)

def find_closest_match(query, all_docs):
    scores = []
    # PAIN POINT: We must loop through EVERY single document
    for idx, doc_vec in enumerate(all_docs):
        # Calculate dot product (simplified similarity)
        score = np.dot(query, doc_vec)
        scores.append((score, idx))
    
    # Sort them to find the highest score
    scores.sort(key=lambda x: x[0], reverse=True)
    return scores[0]

# Measure how long this takes
start_time = time.time()
best_match = find_closest_match(query_vector, document_vectors)
end_time = time.time()

print(f"Found match at index {best_match[1]} with score {best_match[0]:.4f}")
print(f"Search took: {end_time - start_time:.4f} seconds")

Why this is a problem:

It's Slow: As your database grows to 1 million documents, this loop becomes incredibly slow.

It's Rigid: What if you only want to search documents from the year 2023? You would need to write complex if statements inside that loop.

It's Volatile: All these vectors are in memory. If your computer crashes, you have to re-calculate (and re-pay for) all 10,000 embeddings.

The Solution:

A Vector Database (like ChromaDB) does for vectors what Excel does for numbers. It indexes them. Instead of checking every single item, it uses smart algorithms to jump straight to the relevant neighborhood. It handles the storage, the searching, and the filtering for you.

---

Let's Build It

We will use ChromaDB. It is open-source, runs entirely on your machine (no API costs for the database itself), and is the industry standard for starting out.

Step 1: Installation and Setup

First, we need to install the library. Open your terminal/command prompt:

``bash


pip install chromadb



Now, let's create a Python script. We will start by importing Chroma and setting up a "Client". The Client is the interface we use to talk to the database.

import chromadb

# Initialize the client
# This sets up an in-memory database for now (good for testing)
chroma_client = chromadb.Client()

print("ChromaDB client initialized successfully!")


Step 2: Creating a Collection

In a standard SQL database, you have "tables." In a vector database, we have Collections. A collection is a bucket where you store related vectors. You might have one collection for "HR_Policy" and another for "IT_Support".

# Create a collection named "my_knowledge_base"
collection = chroma_client.create_collection(name="my_knowledge_base")

print(f"Collection created: {collection.name}")


Step 3: Adding Documents

This is where the magic happens. We will add text documents to our collection.

Important: Usually, you need to turn text into numbers (embeddings) before saving them. However, ChromaDB has a default embedding model built-in (using a small, free open-source model). If you give it text, it will automatically handle the conversion to numbers for you.

We will add:
 Documents: The actual text content.
 Metadatas: Extra info (tags) about the text. This is crucial for filtering later.
 IDs: A unique name for each chunk (like "doc1", "doc2").

# Adding data to the collection
collection.add(
    documents=[
        "The capital of France is Paris.",
        "To reset the router, hold the button for 10 seconds.",
        "The delicious pizza recipe calls for mozzarella and basil.",
        "Paris is known for the Eiffel Tower and great art museums."
    ],
    metadatas=[
        {"source": "geography_book", "page": 10},
        {"source": "tech_manual", "page": 55},
        {"source": "cookbook", "page": 2},
        {"source": "geography_book", "page": 12}
    ],
    ids=["id1", "id2", "id3", "id4"]
)

print("Documents added to the database!")


Note: The first time you run this, it might take a moment to download the default embedding model. This only happens once.

Step 4: Querying (Semantic Search)

Now, let's ask a question. We don't need to match keywords exactly. We are searching for meaning.

We will ask "How do I fix my internet?" notice that none of our documents contain the word "internet" or "fix". But one contains "reset the router".

results = collection.query(
    query_texts=["How do I fix my internet?"],
    n_results=1  # We only want the top 1 match
)

print("\n--- Query Result ---")
print(f"Question: How do I fix my internet?")
print(f"Best Answer: {results['documents'][0][0]}")
print(f"Metadata: {results['metadatas'][0][0]}")


Output:
Question: How do I fix my internet?
Best Answer: To reset the router, hold the button for 10 seconds.
Metadata: {'source': 'tech_manual', 'page': 55}

The database understood that "fix internet" is semantically close to "reset router".

Step 5: Filtering with Metadata

Imagine you have millions of documents. Searching all of them is wasteful if you know you only care about "geography". This is where Metadata Filtering comes in.

We can use the where parameter to tell Chroma: "Only look at vectors that have this specific tag."



# Search for "Paris" but ONLY within the "geography_book" source
results_filtered = collection.query(
    query_texts=["Tell me about Paris"],
    n_results=1,
    where={"source": "geography_book"} # This is the filter
)

print("\n--- Filtered Result ---")
print(f"Found: {results_filtered['documents'][0][0]}")

If we tried to search for "pizza" with the filter where={"source": "geography_book"}, we would get zero results or a very poor match, because the pizza recipe is in the "cookbook" source.



Step 6: Persistence (Saving to Disk)

So far, if you close your Python script, the data disappears. Let's fix that. We need to use a PersistentClient.



Change your initialization code to look like this:

import chromadb
import os

# Create a folder for the database
db_path = os.path.join(os.getcwd(), "my_chroma_db")

# Initialize a persistent client
persistent_client = chromadb.PersistentClient(path=db_path)

# Get the collection (get_or_create allows us to run this script multiple times without error)
collection = persistent_client.get_or_create_collection(name="persistent_knowledge")

# Add data only if the collection is empty (to avoid adding duplicates every time we run)
if collection.count() == 0:
    collection.add(
        documents=["The sun rises in the east.", "The moon orbits the earth."],
        ids=["fact1", "fact2"]
    )
    print("Data added to persistent database.")
else:
    print("Database already contains data. Skipping add.")

# Query
results = collection.query(
    query_texts=["Where does the sun rise?"],
    n_results=1
)

print(f"Result: {results['documents'][0][0]}")


Run this script twice.
*   Run 1: It will say "Data added..." and give the result.
*   Run 2: It will say "Database already contains data..." and give the result. The data survived the restart!

---

Now You Try

You have the basics. Now extend the project with these three tasks:

 The Updater:

Write a script that updates an existing document. Use collection.upsert (update or insert). Change the text of "id1" (from Step 3) to "The capital of France is Paris, which is a beautiful city." Run a query to confirm the text changed.



 The Deleter:

Write a script that deletes the pizza recipe. Use collection.delete(ids=["id3"]). Try to query for "pizza" afterwards and ensure it returns the next closest result (or nothing relevant), not the recipe.



 The Multi-Filter:

Add a new document: documents=["Another tech tip"], metadatas=[{"source": "tech_manual", "page": 60}], ids=["id5"].

Now, try to query for tech tips but filter for page numbers greater than 50.


    Hint: ChromaDB

where filters can use operators like $gt (greater than).


    Example:

where={"page": {"$gt": 50}}



---

Challenge Project: The "Work vs. Personal" Search System

Your goal is to create a system that manages two separate aspects of your life using two different collections.

Requirements:

Initialize a PersistentClient.

Create two collections: work_docs and personal_docs.

Add at least 3 dummy documents to work_docs (e.g., "Meeting notes", "Project deadlines").

Add at least 3 dummy documents to personal_docs (e.g., "Grocery list", "Birthday ideas").

Create a function search_memory(query, mode):

* If mode is "work", search only the work collection.

* If mode is "personal", search only the personal collection.

* If mode is "all", search both and combine the results.



Example Input/Output:
# User calls:
search_memory("What do I need to buy?", mode="personal")
# Output: "Milk and Eggs" (from personal_docs)

# User calls:
search_memory("When is the deadline?", mode="work")
# Output: "Friday at 5PM" (from work_docs)


Hints:
*   You cannot query two collections in a single line of code. For the "all" mode, you will need to query both collections separately, combine the lists of results in Python, and print them out.

* Remember to check if collections exist before creating them if you run the script multiple times (get_or_create_collection`).

---

What You Learned

Today you moved from "variables in memory" to "persistent databases."

* Vector Databases store meanings, not just keywords.

* ChromaDB is a local, file-based vector database perfect for getting started.

* Collections act like tables or folders to organize data.

* Metadata allows you to filter search results (e.g., "only search PDFs").

* Persistence ensures your AI remembers facts between sessions.

Why This Matters:

This is the heart of RAG (Retrieval-Augmented Generation). In the real world, you cannot paste a 500-page PDF into ChatGPT; it's too big. Instead, you store the PDF in ChromaDB. When you ask a question, you find the one relevant page (using what you learned today) and send only that page to the LLM. You have just built the retrieval engine for that system.

Tomorrow:

We are taking this to the cloud. You will learn about Pinecone, a production-ready vector database that lives online, allowing you to share your AI's memory across different users and computers.

← Day 38 Day 40 →