Day 37 of 80

Document Loading & Parsing

Phase 5: RAG Systems

What You'll Build Today

Welcome to Day 37! Up until now, we’ve been sending short strings of text to LLMs. "Write a poem," or "Summarize this paragraph." But the real power of AI comes when you can let it read your company's handbook, a 50-page legal contract, or a technical manual.

To do that, you first need to get the text out of those files.

Today, we are building a Universal Document Loader. It will be a Python tool that accepts a file path—whether it’s a simple text file, a messy PDF, or a website saved as HTML—and returns clean, raw text that an AI can actually understand.

Here is what you will learn and why:

* File I/O (Input/Output): Because your AI can't read files if your Python script can't open them.

* PDF Parsing: Because PDFs are visual documents, not text documents, and extracting words from them is surprisingly difficult.

* HTML Cleaning: Because websites are full of code (JavaScript, CSS) that will confuse an LLM if you don't strip it out.

Metadata Extraction: Because later on, when your AI answers a question, you'll want it to tell you which file and what page number* the answer came from.

Let's turn your code into a reading machine.

---

The Problem

Imagine you have a folder full of invoices and contracts saved as PDFs. You want to write a script to find every mention of "Total Due."

Your instinct might be to treat a PDF like a normal text file. In Python, we usually read files like this:

# The Intuitive (But Wrong) Approach

try:

# Trying to open a PDF like it's a .txt file

with open("invoice.pdf", "r") as f:

content = f.read()

print(content)

except Exception as e:

print(f"Error: {e}")

If you run this, one of two things happens:

  • Crash: Python throws a UnicodeDecodeError because PDFs contain binary data (images, formatting instructions), not just plain text.
  • Gibberish: If you force it to read as binary, you get output that looks like this: %PDF-1.4 ... ... /Type /Catalog ...
  • The Pain Point:

    Digital documents are designed for humans to look at, not for computers to read.

    * In a Word doc, a paragraph is a logical block of text.

    * In a PDF, that same paragraph is just a set of instructions: "Put the letter 'H' at coordinate 10,20. Put 'e' at 10,25."

    If you have a two-column PDF article, a naive computer reader will read straight across the page, merging column A into column B, creating nonsense sentences.

    We cannot just "open" these files. We have to parse them. We need specialized tools to translate these visual formats into the simple text string that our LLM needs.

    ---

    Let's Build It

    We are going to build a system that detects the file type and uses the correct tool to extract the text.

    Prerequisites

    We need to install two external libraries: one for PDFs and one for HTML.

    * pypdf: A popular library for reading PDF files.

    * beautifulsoup4: The industry standard for cleaning up HTML/Web content.

    Run this in your terminal:

    ``bash

    pip install pypdf beautifulsoup4

    `

    Step 1: Creating Dummy Data

    Since I can't give you actual files through this chat, we will write a quick helper script to generate a sample .txt, .html, and .pdf file on your computer.

    Run this code block once to set up your workspace.

    import os
    

    from fpdf import FPDF # You might need: pip install fpdf

    # Create a simple text file

    with open("sample_notes.txt", "w", encoding="utf-8") as f:

    f.write("Project Alpha Meeting Notes.\nDate: 2023-10-01.\nStatus: On track.")

    # Create a simple HTML file

    html_content = """

    Company Update

    Q3 Financial Results

    Revenue exceeded expectations.

    """

    with open("sample_webpage.html", "w", encoding="utf-8") as f:

    f.write(html_content)

    # Create a simple PDF (Requires fpdf, or just trust me and use a real PDF you have) # If you don't want to install fpdf, just find any PDF on your computer # and rename it 'sample_doc.pdf' for this exercise.

    try:

    pdf = FPDF()

    pdf.add_page()

    pdf.set_font("Arial", size=12)

    pdf.cell(200, 10, txt="Confidential Project Specs", ln=1, align="C")

    pdf.cell(200, 10, txt="1. The product must be fast.", ln=2)

    pdf.cell(200, 10, txt="2. The product must be secure.", ln=2)

    pdf.output("sample_doc.pdf")

    print("Files created successfully!")

    except ImportError:

    print("Please install fpdf (pip install fpdf) or provide your own 'sample_doc.pdf'")

    Step 2: Handling Text Files

    This is the baseline. Text files are simple, but we must handle encoding. If you open a file created on Windows on a Mac (or vice-versa), you might get weird characters unless you specify utf-8.

    def load_text_file(file_path):
    

    """

    Reads a simple .txt file and returns the string.

    """

    try:

    # always specify encoding='utf-8' to avoid special character crashes

    with open(file_path, "r", encoding="utf-8") as f:

    return f.read()

    except Exception as e:

    return f"Error reading text file: {e}"

    # Test it

    print("--- Text File Output ---")

    print(load_text_file("sample_notes.txt"))

    Step 3: Handling PDFs with pypdf

    Here is where we solve the binary problem. We use PdfReader. A PDF is made of pages, so we have to loop through them and extract text page by page.

    from pypdf import PdfReader
    
    

    def load_pdf_file(file_path):

    """

    Reads a PDF, loops through pages, and extracts text.

    """

    text_content = []

    try:

    reader = PdfReader(file_path)

    # Loop through every page in the PDF

    for i, page in enumerate(reader.pages):

    # Extract text from the page

    text = page.extract_text()

    # If extract_text() finds nothing, it returns None or empty string

    if text:

    text_content.append(f"--- Page {i+1} ---\n{text}")

    return "\n".join(text_content)

    except Exception as e:

    return f"Error reading PDF: {e}"

    # Test it

    print("\n--- PDF File Output ---")

    print(load_pdf_file("sample_doc.pdf"))

    Note: You will see that
    pypdf does a decent job, but sometimes headers and footers get mixed into the main text. This is a common challenge in RAG systems.

    Step 4: Handling HTML with BeautifulSoup

    If you load HTML raw, you get

    . We don't want the tags. We want the text inside them. However, we also don't want text inside