Day 37 of 80

Document Loading & Parsing

Phase 5: RAG Systems

What You'll Build Today

Welcome to Day 37! Up until now, we’ve been sending short strings of text to LLMs. "Write a poem," or "Summarize this paragraph." But the real power of AI comes when you can let it read your company's handbook, a 50-page legal contract, or a technical manual.

To do that, you first need to get the text out of those files.

Today, we are building a Universal Document Loader. It will be a Python tool that accepts a file path—whether it’s a simple text file, a messy PDF, or a website saved as HTML—and returns clean, raw text that an AI can actually understand.

Here is what you will learn and why:

* File I/O (Input/Output): Because your AI can't read files if your Python script can't open them.

* PDF Parsing: Because PDFs are visual documents, not text documents, and extracting words from them is surprisingly difficult.

* HTML Cleaning: Because websites are full of code (JavaScript, CSS) that will confuse an LLM if you don't strip it out.

Metadata Extraction: Because later on, when your AI answers a question, you'll want it to tell you which file and what page number* the answer came from.

Let's turn your code into a reading machine.

---

The Problem

Imagine you have a folder full of invoices and contracts saved as PDFs. You want to write a script to find every mention of "Total Due."

Your instinct might be to treat a PDF like a normal text file. In Python, we usually read files like this:

# The Intuitive (But Wrong) Approach
try:
    # Trying to open a PDF like it's a .txt file
    with open("invoice.pdf", "r") as f:
        content = f.read()
        print(content)
except Exception as e:
    print(f"Error: {e}")

If you run this, one of two things happens:

Crash: Python throws a UnicodeDecodeError because PDFs contain binary data (images, formatting instructions), not just plain text.

Gibberish: If you force it to read as binary, you get output that looks like this: %PDF-1.4 ... ... /Type /Catalog ...

The Pain Point:

Digital documents are designed for humans to look at, not for computers to read.

* In a Word doc, a paragraph is a logical block of text.

* In a PDF, that same paragraph is just a set of instructions: "Put the letter 'H' at coordinate 10,20. Put 'e' at 10,25."

If you have a two-column PDF article, a naive computer reader will read straight across the page, merging column A into column B, creating nonsense sentences.

We cannot just "open" these files. We have to parse them. We need specialized tools to translate these visual formats into the simple text string that our LLM needs.

---

Let's Build It

We are going to build a system that detects the file type and uses the correct tool to extract the text.

Prerequisites

We need to install two external libraries: one for PDFs and one for HTML.

* pypdf: A popular library for reading PDF files.

* beautifulsoup4: The industry standard for cleaning up HTML/Web content.

Run this in your terminal:

``bash


pip install pypdf beautifulsoup4



Step 1: Creating Dummy Data

Since I can't give you actual files through this chat, we will write a quick helper script to generate a sample .txt, .html, and .pdf file on your computer.



Run this code block once to set up your workspace.

import os
from fpdf import FPDF # You might need: pip install fpdf

# Create a simple text file
with open("sample_notes.txt", "w", encoding="utf-8") as f:
    f.write("Project Alpha Meeting Notes.\nDate: 2023-10-01.\nStatus: On track.")

# Create a simple HTML file
html_content = """

Company Update

    Q3 Financial Results
    Revenue exceeded expectations.
    
    Copyright 2023


"""
with open("sample_webpage.html", "w", encoding="utf-8") as f:
    f.write(html_content)

# Create a simple PDF (Requires fpdf, or just trust me and use a real PDF you have)
# If you don't want to install fpdf, just find any PDF on your computer 
# and rename it 'sample_doc.pdf' for this exercise.
try:
    pdf = FPDF()
    pdf.add_page()
    pdf.set_font("Arial", size=12)
    pdf.cell(200, 10, txt="Confidential Project Specs", ln=1, align="C")
    pdf.cell(200, 10, txt="1. The product must be fast.", ln=2)
    pdf.cell(200, 10, txt="2. The product must be secure.", ln=2)
    pdf.output("sample_doc.pdf")
    print("Files created successfully!")
except ImportError:
    print("Please install fpdf (pip install fpdf) or provide your own 'sample_doc.pdf'")


Step 2: Handling Text Files

This is the baseline. Text files are simple, but we must handle encoding. If you open a file created on Windows on a Mac (or vice-versa), you might get weird characters unless you specify utf-8.



def load_text_file(file_path):
    """
    Reads a simple .txt file and returns the string.
    """
    try:
        # always specify encoding='utf-8' to avoid special character crashes
        with open(file_path, "r", encoding="utf-8") as f:
            return f.read()
    except Exception as e:
        return f"Error reading text file: {e}"

# Test it
print("--- Text File Output ---")
print(load_text_file("sample_notes.txt"))

`Step 3: Handling PDFs with` pypdf

Here is where we solve the binary problem. We use PdfReader. A PDF is made of pages, so we have to loop through them and extract text page by page.



from pypdf import PdfReader

def load_pdf_file(file_path):
    """
    Reads a PDF, loops through pages, and extracts text.
    """
    text_content = []
    
    try:
        reader = PdfReader(file_path)
        
        # Loop through every page in the PDF
        for i, page in enumerate(reader.pages):
            # Extract text from the page
            text = page.extract_text()
            
            # If extract_text() finds nothing, it returns None or empty string
            if text:
                text_content.append(f"--- Page {i+1} ---\n{text}")
                
        return "\n".join(text_content)
        
    except Exception as e:
        return f"Error reading PDF: {e}"

# Test it
print("\n--- PDF File Output ---")
print(load_pdf_file("sample_doc.pdf"))


Note: You will see that

pypdf does a decent job, but sometimes headers and footers get mixed into the main text. This is a common challenge in RAG systems.

`Step 4: Handling HTML with` BeautifulSoup

If you load HTML raw, you get

. We don't want the tags. We want the text inside them. However, we also don't want text inside

Document Loading & Parsing

What You'll Build Today

The Problem

Let's Build It

Prerequisites

Step 1: Creating Dummy Data

Q3 Financial Results

Step 2: Handling Text Files

Step 3: Handling PDFs with pypdf

Step 4: Handling HTML with BeautifulSoup

`Step 3: Handling PDFs with` pypdf

`Step 4: Handling HTML with` BeautifulSoup