Fully Local RAG
What You'll Build Today
Welcome to Day 64! Today marks a significant turning point in your journey. Up until now, almost every AI application we have built relied on a "brain" in the cloud. We sent data to OpenAI, Anthropic, or Cohere, and waited for them to send an answer back.
Today, we cut the cord.
You are going to build a Fully Local RAG System. This means every part of the intelligence stack—creating the numerical representations of text (embeddings), storing them (vector database), and generating the answer (LLM)—will run entirely on your own laptop.
Here is what you will master today:
* Local Embeddings: You will use sentence-transformers to turn text into numbers on your own CPU/GPU, ensuring no data leaves your machine during the indexing phase.
* Local Vector Storage: You will use ChromaDB running locally to store and retrieve these numbers, removing the need for cloud databases like Pinecone.
* Local Inference: You will connect to Ollama to generate human-like responses without an API key or an internet connection.
* Privacy-First Architecture: You will understand how to architect systems for highly regulated industries (healthcare, finance) where data privacy is non-negotiable.
Let's sever the internet connection and see what your computer can really do.
---
The Problem
Imagine you have just been hired by a law firm or a hospital. They are incredibly excited about the potential of AI to summarize patient records or analyze legal contracts.
You confidently write a script using the tools we have used so far. It looks something like this:
import os
from openai import OpenAI
# The standard way we've been doing things
client = OpenAI(api_key="sk-...")
sensitive_medical_record = """
Patient: John Doe
Diagnosis: [HIGHLY SENSITIVE CONDITION]
Treatment Plan: [CONFIDENTIAL DRUG TRIAL]
"""
# We want to embed this to search it later
response = client.embeddings.create(
input=sensitive_medical_record,
model="text-embedding-3-small"
)
print("Embedding received from cloud.")
You show this to the Chief Security Officer, and they immediately shut down your project.
Why?sensitive_medical_record left the hospital's secure server and traveled across the public internet to OpenAI's servers. Even if OpenAI is secure, the data has left your control. For HIPAA or GDPR compliance, this is often a dealbreaker.The pain here is distinct: You have the logic, but you cannot use the cloud. You need a way to replicate the entire intelligence pipeline inside your own firewall.
---
Let's Build It
We are going to rebuild the RAG pipeline using open-source tools that run locally.
Prerequisites:llama3 for this tutorial). Run ollama pull llama3 in your terminal before starting.Step 1: Install Local Libraries
We need a few specific Python libraries.
* sentence-transformers: The industry standard for running embedding models locally.
* chromadb: An open-source vector database that runs easily as a file on your computer.
* ollama: The Python library to talk to your local Ollama instance.
``bash
pip install sentence-transformers chromadb ollama
`
Step 2: Local Embeddings
First, let's solve the embedding problem. Instead of calling an API, we will download a small, efficient model called
all-MiniLM-L6-v2. It's tiny (about 80MB) and runs fast on a standard CPU.
Note: The first time you run this, it will download the model weights from the internet. After that, it works completely offline.
from sentence_transformers import SentenceTransformer
# Load the model locally
# This downloads the model to your machine once, then loads from disk
print("Loading embedding model...")
embed_model = SentenceTransformer('all-MiniLM-L6-v2')
text = "The cat sits outside"
embedding = embed_model.encode(text)
print(f"Model loaded successfully.")
print(f"Embedding dimension: {len(embedding)}")
print(f"First 5 numbers: {embedding[:5]}")
Why this matters: You just turned text into numbers without an API key. This happened entirely in your computer's RAM.
Step 3: Local Vector Store (ChromaDB)
Now we need a place to store these embeddings. We will use ChromaDB. We will configure it to save data to a folder on your computer so it persists even if you close the script.
import chromadb
# Initialize the client with a persistent path
# This creates a folder named 'local_rag_db' in your project directory
chroma_client = chromadb.PersistentClient(path="./local_rag_db")
# Create (or get) a collection
# A collection is like a table in SQL
collection = chroma_client.get_or_create_collection(name="private_documents")
print("ChromaDB initialized locally.")
Step 4: Ingesting Private Data
Let's add some "secret" data that we pretend cannot leave the building. We will manually embed the documents using our local model and then push them into Chroma.
Note: Chroma can actually use sentence-transformers automatically, but we are doing it manually here so you understand exactly how the pipeline works.
# Our "Secret" documents
documents = [
"Project Apollo: The launch code is 8842. Do not share.",
"Project Zeus: The meeting is moved to Room 404.",
"HR Policy: Lunch is free on Fridays only."
]
ids = ["doc1", "doc2", "doc3"]
# 1. Create embeddings locally
print("Embedding documents...")
embeddings = embed_model.encode(documents)
# 2. Add to ChromaDB
# We store the embedding AND the original text
collection.add(
embeddings=embeddings,
documents=documents,
ids=ids
)
print(f"Added {len(documents)} confidential documents to local DB.")
Step 5: The Retrieval
Now, let's ask a question. We need to:
Embed the question using the same local model.
Ask Chroma to find the nearest neighbor.
query = "What is the launch code for Apollo?"
# 1. Embed the query
query_embedding = embed_model.encode([query])
# 2. Query the database
results = collection.query(
query_embeddings=query_embedding,
n_results=1 # We just want the best match
)
retrieved_doc = results['documents'][0][0]
print(f"Query: {query}")
print(f"Retrieved Context: {retrieved_doc}")
Step 6: The Generation (Connecting Ollama)
Finally, we connect the retrieved context to our local LLM (Ollama). This replaces the call to GPT-4.
import ollama
def local_rag_chat(question):
print(f"\nProcessing: {question}")
# 1. Retrieve
q_embed = embed_model.encode([question])
results = collection.query(
query_embeddings=q_embed,
n_results=1
)
# Check if we found anything
if not results['documents'][0]:
return "I couldn't find any information on that."
context = results['documents'][0][0]
print(f"Found context: {context}")
# 2. Construct Prompt
prompt = f"""
You are a helpful assistant. Answer the question based ONLY on the provided context.
Context: {context}
Question: {question}
"""
# 3. Generate with Ollama
# ensure you have run 'ollama pull llama3' in your terminal previously
response = ollama.chat(model='llama3', messages=[
{'role': 'user', 'content': prompt},
])
return response['message']['content']
# Test it out
answer = local_rag_chat("What is the launch code?")
print("-" * 50)
print("FINAL ANSWER:")
print(answer)
Output:
You should see the system find the "Project Apollo" document and then Llama 3 should tell you the code is 8842. If you unplug your internet cable right now and run this block again, it will still work perfectly.
---
Now You Try
You have a working local RAG pipeline. Now, let's expand its capabilities.
Change the Brain:
The
sentence-transformers library has many models. The one we used (all-MiniLM-L6-v2) is fast but small. Try switching the embedding model to all-mpnet-base-v2. This is a larger, more accurate model. Note how the download size and processing time change.
Add Metadata Filtering:
In a real company, you might want to search only "HR" documents.
* Clear your collection (
chroma_client.delete_collection("private_documents")).
* Re-add documents, but this time add a
metadatas parameter: metadatas=[{"dept": "Engineering"}, {"dept": "Engineering"}, {"dept": "HR"}].
* Modify the query to filter:
collection.query(..., where={"dept": "HR"}).
The "I Don't Know" Guardrail:
Currently, if the distance between the query and the document is huge (meaning they aren't related), the system still returns a document.
* Look at
results['distances'] in the query output.
* Add logic: If the distance is greater than a certain threshold (e.g., 1.5), do not send the context to Ollama. Instead, return "No relevant documents found locally."
---
Challenge Project: The "Local vs. Cloud" Showdown
Your stakeholders are skeptical. They think local AI is "too dumb" compared to OpenAI. You need to create a comparison tool to prove the tradeoffs.
Requirements:
Create a list of 3 complex questions based on a text file you provide (e.g., paste a Wikipedia article into a text file).
Write a script that answers these 3 questions using Local RAG (Ollama + Chroma).
Write a script that answers the same 3 questions using Cloud RAG (OpenAI + Chroma).
Measure and print the Time Taken for each.
Save the answers side-by-side in a CSV file or print a formatted table.
Example Output:
QUESTION 1: Who founded the company?
------------------------------------------------
LOCAL (Llama3):
Answer: The company was founded by...
Time: 4.2 seconds
CLOUD (GPT-4o):
Answer: According to the documents, the founders were...
Time: 1.1 seconds
------------------------------------------------
Hints:
* Use the
time library (import time, start = time.time(), etc.) to measure duration.
You can reuse the same Chroma collection for both if you use the same* embedding model. If you use OpenAI embeddings for the cloud version, you need a separate collection! (This is a common trap: you cannot search OpenAI embeddings using a local embedding model).
---
What You Learned
Today you built a fortress. You learned how to decouple your AI applications from the internet.
* Sentence Transformers: You learned to generate embeddings on your own hardware using
encode()`.
* ChromaDB Persistence: You learned how to save vector data to disk so it survives a restart.
* Ollama Integration: You learned how to pipe retrieved context into a locally running LLM.
* Data Sovereignty: You now have a solution for clients who say "My data cannot leave the building."
Why This Matters:In the enterprise world, "cool" isn't enough. "Compliant" is what gets contracts signed. By offering a fully local option, you open doors to government, defense, healthcare, and banking sectors that are closed to pure-cloud developers.
Tomorrow:Now that we have systems running locally and in the cloud, how do we know what they are actually doing? Tomorrow we dive into Observability. We will learn how to trace the "thought process" of your chains and debug complex RAG failures. See you then!