Day 49 of 80

Project: Customer Support Bot (Day 1)

Phase 5: RAG Systems

Here is the comprehensive curriculum for Day 49: Project: Customer Support Bot (Day 1).

---

What You'll Build Today

Welcome to Day 49! Today marks the start of a two-day capstone project for our RAG (Retrieval-Augmented Generation) module. You are going to build the brain of an intelligent Customer Support Bot.

Imagine a support team that is drowning in tickets. Users keep asking, "How do I reset my password?" or "What is your refund policy?" over and over. The answers are in the documentation, but users don't read manuals. They want answers now.

Today, you will build the Ingestion Pipeline. This is the backend system that reads your company's documentation, understands what it means, and organizes it so an AI can find the exact answer in milliseconds.

Here is what you will master today:

* Document Loading: How to programmatically read multiple files from a folder so you don't have to copy-paste text manually.

* Recursive Chunking: How to break large documents into bite-sized pieces that fit inside an AI's context window without cutting off sentences in the middle.

* Vector Storage: How to save these chunks into a database (ChromaDB) that understands meaning, not just keywords.

By the end of this lesson, you will have a searchable database of knowledge ready to be plugged into a chat interface tomorrow.

---

The Problem

Before we build the smart solution, let's look at how most beginners try to build a support bot. They usually try one of two bad approaches:

The "If/Else" Nightmare: Hardcoding every possible question.

The "Ctrl+F" Failure: Searching for exact keywords.

Let's look at the "If/Else" Nightmare. Imagine you are trying to write a bot to answer questions about a coffee machine.

def get_support_answer(user_question):
    # Normalize the input
    q = user_question.lower()

    if "turn on" in q:
        return "Press the big green button on the side."
    elif "coffee is cold" in q:
        return "Check if the warming plate is active."
    elif "cleaning" in q:
        return "Run the descaling cycle with vinegar."
    elif "error 404" in q:
        return "Unplug the machine and plug it back in."
    else:
        return "I don't understand. Please contact a human."

# Let's test it
print(get_support_answer("How do I turn on the machine?")) 
# Output: Press the big green button on the side. (Works!)

print(get_support_answer("Where is the power switch?"))
# Output: I don't understand. Please contact a human. (FAIL!)

The Pain:

* Fragile: The user asked about the "power switch," which means "turn on," but your code didn't catch it because the specific words "turn on" weren't used.

* Unscalable: Imagine writing if/elif statements for a 50-page user manual. You would need thousands of lines of code.

* Maintenance Hell: Every time the product updates, you have to rewrite the code.

There has to be a way to let the computer understand that "power switch" and "turn on" are semantically related, without us explicitly programming it. That is what we are building today.

---

Let's Build It

We are going to build a pipeline that takes raw text files and turns them into a searchable vector database.

Prerequisites

You will need a few libraries installed. In your terminal:

``bash


pip install langchain langchain-community langchain-openai chromadb tiktoken



You will also need your OpenAI API key set in your environment variables.

Step 1: Create the "Knowledge Base"

First, we need data. Since we don't have a real company manual handy, let's write a script to generate some "fake" documentation for a fictional product called the GadgetPro 3000.

Create a file named create_data.py and run it once. This will create a folder called support_docs with text files inside.



import os

# Create a directory for our documents
os.makedirs("support_docs", exist_ok=True)

# Document 1: Getting Started
doc1_content = """
# GadgetPro 3000: Getting Started Guide

Unboxing
Your GadgetPro 3000 comes with the main unit, a USB-C charging cable, and a protective case.
Please ensure all items are present before discarding the box.

Powering On
To turn on the device, press and hold the silver button on the right side for 3 seconds.
The LED indicator will flash blue when the device is ready.

Initial Setup
Download the GadgetPro app from the App Store. Enable Bluetooth on your phone.
Open the app and follow the on-screen pairing instructions.
"""

# Document 2: Troubleshooting
doc2_content = """
# GadgetPro 3000: Troubleshooting & FAQ

Device Won't Charge
Check if the USB-C cable is firmly connected.
Ensure you are using a 5V/2A power adapter.
If the LED blinks red, the battery is critically low. Leave it plugged in for 30 minutes.

Connection Issues
If the device disconnects frequently, try resetting the network settings.
Hold the volume up and power button simultaneously for 10 seconds.

Refund Policy
We offer a 30-day money-back guarantee. If you are not satisfied, contact support@gadgetpro.com.
Items must be returned in original packaging.
"""

# Write files to disk
with open("support_docs/getting_started.txt", "w") as f:
    f.write(doc1_content)

with open("support_docs/troubleshooting.txt", "w") as f:
    f.write(doc2_content)

print("Success: created 'support_docs' folder with 2 text files.")


Run this code. You should see a new folder in your project directory.

Step 2: Loading the Documents

Now we start the actual application. We need to load all text files from that directory. We will use LangChain's DirectoryLoader.

Create a new file named ingest.py.



import os
from langchain_community.document_loaders import DirectoryLoader, TextLoader

def load_documents():
    print("Loading documents from ./support_docs...")
    
    # Initialize the loader to read all .txt files in the folder
    loader = DirectoryLoader(
        './support_docs', 
        glob="*/.txt", 
        loader_cls=TextLoader
    )
    
    # Load the data
    docs = loader.load()
    
    print(f"Successfully loaded {len(docs)} documents.")
    return docs

# Run the function to test
if __name__ == "__main__":
    raw_documents = load_documents()
    
    # Print a preview of the first document to verify
    print(f"\n--- Preview of first document ---")
    print(raw_documents[0].page_content[:200]) # First 200 chars


Why this matters: This abstracts away the complexity of opening files, reading lines, and closing files. It handles the I/O (Input/Output) heavy lifting.

Step 3: Chunking (The Meat Grinder)

If we feed a 50-page manual into an AI all at once, two things happen:
 It costs a lot of money.
 The AI gets confused by too much information.

We need to split the text into smaller, overlapping chunks. We use RecursiveCharacterTextSplitter. This is "smart" splitting—it tries to split by paragraphs first, then sentences, then words, so it keeps related ideas together.

Add this function to ingest.py:



from langchain_text_splitters import RecursiveCharacterTextSplitter

def split_documents(docs):
    print("\nSplitting documents into chunks...")
    
    # Create the splitter
    # chunk_size=500: Each chunk will be roughly 500 characters
    # chunk_overlap=50: The last 50 chars of chunk 1 will be the first 50 of chunk 2.
    # This overlap ensures we don't cut a sentence in half and lose context.
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
        add_start_index=True
    )
    
    chunks = text_splitter.split_documents(docs)
    
    print(f"Split {len(docs)} documents into {len(chunks)} chunks.")
    return chunks

# Update the main block to test
if __name__ == "__main__":
    raw_documents = load_documents()
    chunks = split_documents(raw_documents)
    
    print(f"\n--- Preview of a chunk ---")
    print(chunks[1].page_content)
    print(f"Metadata: {chunks[1].metadata}")


Why overlap matters: If a sentence says "To reset the device, hold the button..." and we cut the chunk right after "device,", the next chunk starts with "hold the button...". Without overlap, the second chunk wouldn't know what button holding achieves. Overlap bridges the gap.

Step 4: Storing in the Vector Database

Now we convert these text chunks into Embeddings (lists of numbers representing meaning) and store them in ChromaDB.

Update ingest.py with the final imports and logic:



import shutil # Used to clear old database files
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

# Define where we want to save the database on disk
CHROMA_PATH = "chroma_db"

def save_to_chroma(chunks):
    # Clear out the database first to avoid duplicates if we run this script twice
    if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH)
    
    # Initialize the OpenAI Embedding function
    # This is the "translator" that turns text into vectors
    embedding_function = OpenAIEmbeddings(api_key="YOUR_OPENAI_API_KEY") 

    print(f"\nSaving {len(chunks)} chunks to ChromaDB at {CHROMA_PATH}...")
    
    # Create the database from our documents
    db = Chroma.from_documents(
        chunks, 
        embedding_function, 
        persist_directory=CHROMA_PATH
    )
    
    print("Saved!")
    return db

if __name__ == "__main__":
    # 1. Load
    raw_documents = load_documents()
    
    # 2. Split
    chunks = split_documents(raw_documents)
    
    # 3. Store
    db = save_to_chroma(chunks)


Note: Replace

"YOUR_OPENAI_API_KEY" with your actual key or use os.environ.get("OPENAI_API_KEY")

.

Step 5: Verification (The "Does it work?" Test)

We aren't building the chat interface today, but we must verify the database works. We will perform a "Similarity Search." We will ask a question and see if the database returns the correct chunk of text.

Create a new file called query_test.py.



from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

CHROMA_PATH = "chroma_db"

def test_query():
    # Prepare the embedding function
    embedding_function = OpenAIEmbeddings(api_key="YOUR_OPENAI_API_KEY")

    # Load the existing database from disk
    db = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function)

    # The user's question
    query_text = "How do I get my money back?"
    print(f"Query: {query_text}")

    # Search the DB for the 3 most similar chunks
    results = db.similarity_search_with_score(query_text, k=3)

    print("\n--- Results ---")
    for doc, score in results:
        # Score: Lower is better (distance score)
        print(f"\nScore: {score:.4f}") 
        print(f"Source: {doc.metadata.get('source')}")
        print(f"Content: {doc.page_content}")

if __name__ == "__main__":
    test_query()


Run

query_test.py.



Even though the document uses the phrase "Refund Policy" and "money-back guarantee," and your query was "How do I get my money back?", the system should find the correct text. It understands the meaning of the request matches the meaning of the policy.

---

Now You Try

You have a working ingestion pipeline. Now, let's make it robust.

 Experiment with Chunk Sizes:

In ingest.py, change chunk_size to 100 and chunk_overlap to 0. Re-run the ingestion, then run query_test.py. Does the result look different? (You should see very fragmented sentences).



 Add a New Document:

Create a new text file in support_docs called warranty.txt. Add some text about a "2-year limited warranty covering manufacturing defects." Re-run ingest.py and then query for "How long is the warranty?".



 Metadata Filtering:

In query_test.py, modify the search to only look for answers inside the troubleshooting file.


    Hint: You can pass a

filter argument to similarity_search

:
        db.similarity_search(query_text, k=3, filter={"source": "support_docs/troubleshooting.txt"})
    

---

Challenge Project: The Corporate Archive

Your manager has given you a folder containing simulated internal memos. Your job is to ingest them all and create a system that can identify which department a memo came from.

Requirements:
 Create a Python script to generate 20 dummy text files. Half should start with "Department: HR" and the other half "Department: Engineering".
 Ingest all 20 files.

When you load the documents, you must programmatically add a metadata field called department based on the text content (e.g., if the file contains "Department: HR", set metadata['department'] = 'HR').


 Save them to a new ChromaDB instance.
 Write a query script that searches for "hiring process" but filters so it only looks at HR documents.

Hints:

* You can loop through the docs list after loading them but before splitting them.

* Access metadata via doc.metadata['new_field'] = 'value'.

* String manipulation is your friend: if "HR" in doc.page_content: ...



---

What You Learned

Today you moved away from writing logic and started building systems.

* Ingestion: You learned that AI needs clean, structured data. You used DirectoryLoader to automate this.

* Chunking: You learned that context windows are limited. You used RecursiveCharacterTextSplitter` to preserve meaning while reducing size.

* Vector Storage: You learned that to find answers fast, we convert text to numbers (embeddings) and store them in a specialized database (Chroma).

Why This Matters:

This pipeline is the foundation of almost every enterprise AI application. Whether it's a legal bot searching case files, a medical bot searching patient history, or a coding assistant searching documentation—they all start with this exact ingestion process.

Tomorrow: We will take this database and connect it to an LLM (ChatModel). You will build the actual chat interface where the AI retrieves these chunks and formulates a polite, human-like answer for the customer. See you then!

← Day 48 Day 50 →