Day 41 of 80

Building Chat with PDF

Phase 5: RAG Systems

What You'll Build Today

Today is a milestone day. If you ask anyone in the industry what the most requested feature for Generative AI is, the answer is almost always the same: "I want to chat with my own documents."

You have learned the individual pieces over the last few days: embeddings, vector databases, and basic retrieval. Today, we glue them all together into a complete, end-to-end application.

You are going to build a "Chat with PDF" engine. You will write a program that takes a PDF file (like a user manual, a legal contract, or a research paper), reads it, remembers it, and allows you to ask questions about specific details within it.

Here is what you will master today:

* Document Loading: How to convert binary PDF files into raw text that Python can understand.

* Text Splitting: Why you cannot just feed a whole book to an AI, and how to break it into smart "chunks."

* Metadata Management: Keeping track of page numbers so your AI can cite its sources (e.g., "Found on page 12").

* The RAG Chain: Linking the retrieval of data to the generation of answers in a seamless loop.

This is the foundation of almost every enterprise AI tool currently on the market. Let's build it.

The Problem

Imagine you have a 50-page PDF manual for a new refrigerator. You want to know how to change the water filter.

Your current knowledge of OpenAI's API suggests you should just read the text and send it to the model. You might try writing code like this:

# WARNING: This code will fail for large documents

import PyPDF2

from openai import OpenAI

client = OpenAI()

# 1. Read the ENTIRE PDF

pdf_text = ""

with open("fridge_manual.pdf", "rb") as file:

reader = PyPDF2.PdfReader(file)

for page in reader.pages:

pdf_text += page.extract_text()

# 2. Try to send ALL text to the LLM

response = client.chat.completions.create(

model="gpt-4",

messages=[

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": f"Here is a manual: {pdf_text}. \n\n How do I change the filter?"}

]

)

The Pain:

When you run this with a real manual, you will likely hit a wall. You will see an error that looks like this:

Error: Request too large for model. This model's maximum context length is 8192 tokens. However, your messages resulted in 15000 tokens. Why this happens:
  • Context Limits: LLMs have a limit on how much text they can process at once. You cannot fit a textbook into a text message.
  • Cost: Even if you could fit it in, paying to process 50 pages of text just to answer a one-sentence question is incredibly expensive and slow.
  • Confusion: Giving an LLM too much irrelevant information can cause it to "hallucinate" or get confused by similar details on different pages.
  • We need a way to only send the relevant paragraphs to the LLM, not the whole book. We need RAG (Retrieval-Augmented Generation).

    Let's Build It

    We will build a pipeline that processes the PDF once, saves it into a searchable format, and then retrieves only what is needed.

    Step 0: Setup and Imports

    We need a few specific libraries today.

    * langchain-openai: To talk to the model.

    * langchain-community: Contains the document loaders.

    * pypdf: To actually read the PDF files.

    * chromadb: Our vector database.

    Note: Ensure you have these installed via pip.
    import os
    

    from langchain_openai import OpenAIEmbeddings, ChatOpenAI

    from langchain_community.document_loaders import PyPDFLoader

    from langchain_text_splitters import RecursiveCharacterTextSplitter

    from langchain_chroma import Chroma

    from langchain_core.prompts import ChatPromptTemplate

    # Set your API Key

    os.environ["OPENAI_API_KEY"] = "sk-..." # Replace with your key

    Step 1: Loading the PDF

    First, we need to get the text out of the PDF. We won't just extract raw text strings; we will extract Document objects. A Document object in LangChain holds two things:

  • page_content: The actual text.
  • metadata: Info about the text (source filename, page number, etc.).
  • We will use a sample PDF. You can download any simple PDF (like a W-4 form or a small user manual) and name it sample.pdf.

    # Load the PDF
    # Make sure 'sample.pdf' exists in your folder
    

    loader = PyPDFLoader("sample.pdf")

    # This loads the file and splits it by page automatically

    raw_documents = loader.load()

    print(f"Loaded {len(raw_documents)} pages.")

    print("--- Content of Page 1 ---")

    print(raw_documents[0].page_content[:200]) # First 200 characters

    print("--- Metadata of Page 1 ---")

    print(raw_documents[0].metadata)

    Why this matters: Look at the metadata output. It likely says {'source': 'sample.pdf', 'page': 0}. This is how we will eventually tell the user where we found the answer.

    Step 2: Splitting into Chunks

    Even a single page of a PDF might be too much information, or contain multiple distinct topics. We want to split our documents into smaller "chunks."

    We use RecursiveCharacterTextSplitter. It tries to split by paragraphs first, then sentences, then words, ensuring we don't break a sentence in the middle.

    # Initialize the splitter
    

    text_splitter = RecursiveCharacterTextSplitter(

    chunk_size=1000, # Target size of each chunk

    chunk_overlap=200, # Overlap between chunks to maintain context

    add_start_index=True

    )

    # Split the documents

    chunks = text_splitter.split_documents(raw_documents)

    print(f"Original pages: {len(raw_documents)}")

    print(f"Total chunks created: {len(chunks)}")

    print("--- Example Chunk ---")

    print(chunks[0].page_content)

    Why Overlap? If a sentence starts at the end of Chunk 1 and finishes at the start of Chunk 2, the meaning is lost if we cut strictly. Overlap ensures the full thought is captured in at least one chunk.

    Step 3: Indexing (Embeddings & Vector Store)

    Now we turn these text chunks into numbers (embeddings) and store them in Chroma. This creates a searchable database of your PDF.

    # Initialize the embedding model
    

    embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")

    # Create the Vector Database # This will embed all chunks and store them in memory

    print("Creating vector database... this may take a moment.")

    vector_db = Chroma.from_documents(

    documents=chunks,

    embedding=embeddings_model

    )

    print("Database created!")

    Step 4: Retrieval and Generation

    This is where the magic happens. We will write a function that:

  • Takes a user question.
  • Searches the database for the top 3 relevant chunks.
  • Sends those chunks + the question to GPT-4.
  • Prints the answer AND the citations.
  • def ask_pdf(question, db):
        # 1. Retrieve relevant chunks
        # k=3 means "give me the 3 most similar chunks"
    

    retriever = db.as_retriever(search_kwargs={"k": 3})

    docs = retriever.invoke(question)

    # 2. Construct the context string # We join the content of the retrieved docs

    context_text = "\n\n".join([doc.page_content for doc in docs])

    # 3. Define the prompt # We explicitly tell the AI to ONLY use the provided context

    prompt_template = ChatPromptTemplate.from_messages([

    ("system", "You are a helpful assistant. Answer the question based ONLY on the following context. If the answer is not in the context, say 'I don't know'."),

    ("user", "Context:\n{context}\n\nQuestion: {question}")

    ])

    # 4. Prepare the model

    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    # 5. Format the prompt with actual data

    prompt_value = prompt_template.invoke({

    "context": context_text,

    "question": question

    })

    # 6. Get the answer

    response = llm.invoke(prompt_value)

    return response.content, docs

    # Let's test it!

    user_question = "What is the main topic of this document?"

    answer, source_docs = ask_pdf(user_question, vector_db)

    print(f"QUESTION: {user_question}")

    print(f"ANSWER: {answer}")

    print("\n--- SOURCES ---")

    for doc in source_docs:

    print(f"Page {doc.metadata.get('page', 'Unknown')}: {doc.page_content[:50]}...")

    Why this works

    We didn't send the whole PDF to GPT. We searched our database for the 3 paragraphs that looked most like the question, and we sent only those to GPT. GPT then acted as a summarizer and synthesizer of that specific information.

    Now You Try

    You have the core engine. Now, make it a usable tool.

  • The Loop: Wrap the "Ask" section in a while True: loop so you can keep asking questions without reloading the PDF every time.
  • Better Citations: Modify the print statement for sources to print the filename and page number clearly (e.g., [Source: sample.pdf - Page 5]).
  • The "I don't know" Test: Ask a question that is definitely NOT in the PDF (e.g., "Who won the World Series in 1998?"). Because of our system prompt ("Answer based ONLY on..."), the model should refuse to answer. Verify this works.
  • Challenge Project: Multi-PDF Librarian

    Your main project handled one PDF. In the real world, we often have folders of documents.

    The Goal: Create a script that loads two different PDFs (e.g., a recipe for cake and a recipe for soup), indexes both, and answers questions while telling you which file the answer came from. Requirements:

    * Place two different PDFs in your folder.

    * Use os.listdir or a list to iterate through filenames.

    Load chunks from both files into a single* list called all_chunks.

    * Create one Vector Database from all_chunks.

    * When printing sources, the output must look like:

    > Answer: The oven should be heated to 350 degrees.

    > Source: cake_recipe.pdf (Page 1)

    Hint:

    The PyPDFLoader automatically adds the filename to metadata['source']. You don't need to do anything special to store it; you just need to make sure you read it when you print the sources at the end.

    What You Learned

    Today you built a true RAG (Retrieval-Augmented Generation) system.

    * Ingestion: You learned that PDFs must be loaded and split, not read as one giant string.

    * Vector Storage: You used Chroma to store the "meaning" of your document chunks.

    Context Injection: You learned that we don't train the model on our data; we inject* relevant data into the prompt at the moment we ask the question.

    * Citations: You learned that metadata is the key to trusting AI, allowing you to verify exactly where an answer came from.

    Why This Matters:

    This is the architecture used by legal AI bots, medical research assistants, and internal corporate knowledge bases. You aren't just scripting anymore; you are building intelligent systems that can process data they were never trained on.

    Tomorrow:

    Right now, we are finding data based on "similarity" (vectors). But what if the user asks a question that doesn't use the exact same keywords as the document? Tomorrow, we look at Advanced Retrieval Strategies to make your search engine much smarter.