Day 64 of 80

Fully Local RAG

Phase 7: Advanced Techniques

What You'll Build Today

Welcome to Day 64! Today marks a significant turning point in your journey. Up until now, almost every AI application we have built relied on a "brain" in the cloud. We sent data to OpenAI, Anthropic, or Cohere, and waited for them to send an answer back.

Today, we cut the cord.

You are going to build a Fully Local RAG System. This means every part of the intelligence stack—creating the numerical representations of text (embeddings), storing them (vector database), and generating the answer (LLM)—will run entirely on your own laptop.

Here is what you will master today:

* Local Embeddings: You will use sentence-transformers to turn text into numbers on your own CPU/GPU, ensuring no data leaves your machine during the indexing phase.

* Local Vector Storage: You will use ChromaDB running locally to store and retrieve these numbers, removing the need for cloud databases like Pinecone.

* Local Inference: You will connect to Ollama to generate human-like responses without an API key or an internet connection.

* Privacy-First Architecture: You will understand how to architect systems for highly regulated industries (healthcare, finance) where data privacy is non-negotiable.

Let's sever the internet connection and see what your computer can really do.

---

The Problem

Imagine you have just been hired by a law firm or a hospital. They are incredibly excited about the potential of AI to summarize patient records or analyze legal contracts.

You confidently write a script using the tools we have used so far. It looks something like this:

import os

from openai import OpenAI

# The standard way we've been doing things

client = OpenAI(api_key="sk-...")

sensitive_medical_record = """

Patient: John Doe

Diagnosis: [HIGHLY SENSITIVE CONDITION]

Treatment Plan: [CONFIDENTIAL DRUG TRIAL]

"""

# We want to embed this to search it later

response = client.embeddings.create(

input=sensitive_medical_record,

model="text-embedding-3-small"

)

print("Embedding received from cloud.")

You show this to the Chief Security Officer, and they immediately shut down your project.

Why?
  • Data Leakage: In the code above, sensitive_medical_record left the hospital's secure server and traveled across the public internet to OpenAI's servers. Even if OpenAI is secure, the data has left your control. For HIPAA or GDPR compliance, this is often a dealbreaker.
  • Vendor Lock-in: What if OpenAI raises prices? What if they deprecate the model you are using? You are entirely dependent on their infrastructure.
  • Latency & Connectivity: If the hospital's internet goes down during a storm, the doctors can't access the AI assistant. A critical tool cannot rely on a shaky Wi-Fi connection.
  • The pain here is distinct: You have the logic, but you cannot use the cloud. You need a way to replicate the entire intelligence pipeline inside your own firewall.

    ---

    Let's Build It

    We are going to rebuild the RAG pipeline using open-source tools that run locally.

    Prerequisites:
  • You must have Ollama installed and running on your machine.
  • Pull a model to use (we will use llama3 for this tutorial). Run ollama pull llama3 in your terminal before starting.
  • Step 1: Install Local Libraries

    We need a few specific Python libraries.

    * sentence-transformers: The industry standard for running embedding models locally.

    * chromadb: An open-source vector database that runs easily as a file on your computer.

    * ollama: The Python library to talk to your local Ollama instance.

    ``bash

    pip install sentence-transformers chromadb ollama

    `

    Step 2: Local Embeddings

    First, let's solve the embedding problem. Instead of calling an API, we will download a small, efficient model called all-MiniLM-L6-v2. It's tiny (about 80MB) and runs fast on a standard CPU.

    Note: The first time you run this, it will download the model weights from the internet. After that, it works completely offline.
    from sentence_transformers import SentenceTransformer
    
    # Load the model locally
    # This downloads the model to your machine once, then loads from disk
    

    print("Loading embedding model...")

    embed_model = SentenceTransformer('all-MiniLM-L6-v2')

    text = "The cat sits outside"

    embedding = embed_model.encode(text)

    print(f"Model loaded successfully.")

    print(f"Embedding dimension: {len(embedding)}")

    print(f"First 5 numbers: {embedding[:5]}")

    Why this matters: You just turned text into numbers without an API key. This happened entirely in your computer's RAM.

    Step 3: Local Vector Store (ChromaDB)

    Now we need a place to store these embeddings. We will use ChromaDB. We will configure it to save data to a folder on your computer so it persists even if you close the script.

    import chromadb
    
    # Initialize the client with a persistent path
    # This creates a folder named 'local_rag_db' in your project directory
    

    chroma_client = chromadb.PersistentClient(path="./local_rag_db")

    # Create (or get) a collection # A collection is like a table in SQL

    collection = chroma_client.get_or_create_collection(name="private_documents")

    print("ChromaDB initialized locally.")

    Step 4: Ingesting Private Data

    Let's add some "secret" data that we pretend cannot leave the building. We will manually embed the documents using our local model and then push them into Chroma.

    Note: Chroma can actually use sentence-transformers automatically, but we are doing it manually here so you understand exactly how the pipeline works.
    # Our "Secret" documents
    

    documents = [

    "Project Apollo: The launch code is 8842. Do not share.",

    "Project Zeus: The meeting is moved to Room 404.",

    "HR Policy: Lunch is free on Fridays only."

    ]

    ids = ["doc1", "doc2", "doc3"]

    # 1. Create embeddings locally

    print("Embedding documents...")

    embeddings = embed_model.encode(documents)

    # 2. Add to ChromaDB # We store the embedding AND the original text

    collection.add(

    embeddings=embeddings,

    documents=documents,

    ids=ids

    )

    print(f"Added {len(documents)} confidential documents to local DB.")

    Step 5: The Retrieval

    Now, let's ask a question. We need to:

  • Embed the question using the same local model.
  • Ask Chroma to find the nearest neighbor.
  • query = "What is the launch code for Apollo?"
    
    # 1. Embed the query
    

    query_embedding = embed_model.encode([query])

    # 2. Query the database

    results = collection.query(

    query_embeddings=query_embedding,

    n_results=1 # We just want the best match

    )

    retrieved_doc = results['documents'][0][0]

    print(f"Query: {query}")

    print(f"Retrieved Context: {retrieved_doc}")

    Step 6: The Generation (Connecting Ollama)

    Finally, we connect the retrieved context to our local LLM (Ollama). This replaces the call to GPT-4.

    import ollama
    
    

    def local_rag_chat(question):

    print(f"\nProcessing: {question}")

    # 1. Retrieve

    q_embed = embed_model.encode([question])

    results = collection.query(

    query_embeddings=q_embed,

    n_results=1

    )

    # Check if we found anything

    if not results['documents'][0]:

    return "I couldn't find any information on that."

    context = results['documents'][0][0]

    print(f"Found context: {context}")

    # 2. Construct Prompt

    prompt = f"""

    You are a helpful assistant. Answer the question based ONLY on the provided context.

    Context: {context}

    Question: {question}

    """

    # 3. Generate with Ollama # ensure you have run 'ollama pull llama3' in your terminal previously

    response = ollama.chat(model='llama3', messages=[

    {'role': 'user', 'content': prompt},

    ])

    return response['message']['content']

    # Test it out

    answer = local_rag_chat("What is the launch code?")

    print("-" * 50)

    print("FINAL ANSWER:")

    print(answer)

    Output:

    You should see the system find the "Project Apollo" document and then Llama 3 should tell you the code is 8842. If you unplug your internet cable right now and run this block again, it will still work perfectly.

    ---

    Now You Try

    You have a working local RAG pipeline. Now, let's expand its capabilities.

  • Change the Brain:
  • The sentence-transformers library has many models. The one we used (all-MiniLM-L6-v2) is fast but small. Try switching the embedding model to all-mpnet-base-v2. This is a larger, more accurate model. Note how the download size and processing time change.

  • Add Metadata Filtering:
  • In a real company, you might want to search only "HR" documents.

    * Clear your collection (chroma_client.delete_collection("private_documents")).

    * Re-add documents, but this time add a metadatas parameter: metadatas=[{"dept": "Engineering"}, {"dept": "Engineering"}, {"dept": "HR"}].

    * Modify the query to filter: collection.query(..., where={"dept": "HR"}).

  • The "I Don't Know" Guardrail:
  • Currently, if the distance between the query and the document is huge (meaning they aren't related), the system still returns a document.

    * Look at results['distances'] in the query output.

    * Add logic: If the distance is greater than a certain threshold (e.g., 1.5), do not send the context to Ollama. Instead, return "No relevant documents found locally."

    ---

    Challenge Project: The "Local vs. Cloud" Showdown

    Your stakeholders are skeptical. They think local AI is "too dumb" compared to OpenAI. You need to create a comparison tool to prove the tradeoffs.

    Requirements:
  • Create a list of 3 complex questions based on a text file you provide (e.g., paste a Wikipedia article into a text file).
  • Write a script that answers these 3 questions using Local RAG (Ollama + Chroma).
  • Write a script that answers the same 3 questions using Cloud RAG (OpenAI + Chroma).
  • Measure and print the Time Taken for each.
  • Save the answers side-by-side in a CSV file or print a formatted table.
  • Example Output:
    QUESTION 1: Who founded the company?
    

    ------------------------------------------------

    LOCAL (Llama3):

    Answer: The company was founded by...

    Time: 4.2 seconds

    CLOUD (GPT-4o):

    Answer: According to the documents, the founders were...

    Time: 1.1 seconds

    ------------------------------------------------

    Hints:

    * Use the time library (import time, start = time.time(), etc.) to measure duration.

    You can reuse the same Chroma collection for both if you use the same* embedding model. If you use OpenAI embeddings for the cloud version, you need a separate collection! (This is a common trap: you cannot search OpenAI embeddings using a local embedding model).

    ---

    What You Learned

    Today you built a fortress. You learned how to decouple your AI applications from the internet.

    * Sentence Transformers: You learned to generate embeddings on your own hardware using encode()`.

    * ChromaDB Persistence: You learned how to save vector data to disk so it survives a restart.

    * Ollama Integration: You learned how to pipe retrieved context into a locally running LLM.

    * Data Sovereignty: You now have a solution for clients who say "My data cannot leave the building."

    Why This Matters:

    In the enterprise world, "cool" isn't enough. "Compliant" is what gets contracts signed. By offering a fully local option, you open doors to government, defense, healthcare, and banking sectors that are closed to pure-cloud developers.

    Tomorrow:

    Now that we have systems running locally and in the cloud, how do we know what they are actually doing? Tomorrow we dive into Observability. We will learn how to trace the "thought process" of your chains and debug complex RAG failures. See you then!