Document Loading & Parsing
What You'll Build Today
Welcome to Day 37! Up until now, we’ve been sending short strings of text to LLMs. "Write a poem," or "Summarize this paragraph." But the real power of AI comes when you can let it read your company's handbook, a 50-page legal contract, or a technical manual.
To do that, you first need to get the text out of those files.
Today, we are building a Universal Document Loader. It will be a Python tool that accepts a file path—whether it’s a simple text file, a messy PDF, or a website saved as HTML—and returns clean, raw text that an AI can actually understand.
Here is what you will learn and why:
* File I/O (Input/Output): Because your AI can't read files if your Python script can't open them.
* PDF Parsing: Because PDFs are visual documents, not text documents, and extracting words from them is surprisingly difficult.
* HTML Cleaning: Because websites are full of code (JavaScript, CSS) that will confuse an LLM if you don't strip it out.
Metadata Extraction: Because later on, when your AI answers a question, you'll want it to tell you which file and what page number* the answer came from.Let's turn your code into a reading machine.
---
The Problem
Imagine you have a folder full of invoices and contracts saved as PDFs. You want to write a script to find every mention of "Total Due."
Your instinct might be to treat a PDF like a normal text file. In Python, we usually read files like this:
# The Intuitive (But Wrong) Approach
try:
# Trying to open a PDF like it's a .txt file
with open("invoice.pdf", "r") as f:
content = f.read()
print(content)
except Exception as e:
print(f"Error: {e}")
If you run this, one of two things happens:
UnicodeDecodeError because PDFs contain binary data (images, formatting instructions), not just plain text.%PDF-1.4 ... ... /Type /Catalog ...Digital documents are designed for humans to look at, not for computers to read.
* In a Word doc, a paragraph is a logical block of text.
* In a PDF, that same paragraph is just a set of instructions: "Put the letter 'H' at coordinate 10,20. Put 'e' at 10,25."
If you have a two-column PDF article, a naive computer reader will read straight across the page, merging column A into column B, creating nonsense sentences.
We cannot just "open" these files. We have to parse them. We need specialized tools to translate these visual formats into the simple text string that our LLM needs.
---
Let's Build It
We are going to build a system that detects the file type and uses the correct tool to extract the text.
Prerequisites
We need to install two external libraries: one for PDFs and one for HTML.
* pypdf: A popular library for reading PDF files.
* beautifulsoup4: The industry standard for cleaning up HTML/Web content.
Run this in your terminal:
``bash
pip install pypdf beautifulsoup4
`
Step 1: Creating Dummy Data
Since I can't give you actual files through this chat, we will write a quick helper script to generate a sample
.txt, .html, and .pdf file on your computer.
Run this code block once to set up your workspace.
import os
from fpdf import FPDF # You might need: pip install fpdf
# Create a simple text file
with open("sample_notes.txt", "w", encoding="utf-8") as f:
f.write("Project Alpha Meeting Notes.\nDate: 2023-10-01.\nStatus: On track.")
# Create a simple HTML file
html_content = """
Company Update
Q3 Financial Results
Revenue exceeded expectations.
"""
with open("sample_webpage.html", "w", encoding="utf-8") as f:
f.write(html_content)
# Create a simple PDF (Requires fpdf, or just trust me and use a real PDF you have)
# If you don't want to install fpdf, just find any PDF on your computer
# and rename it 'sample_doc.pdf' for this exercise.
try:
pdf = FPDF()
pdf.add_page()
pdf.set_font("Arial", size=12)
pdf.cell(200, 10, txt="Confidential Project Specs", ln=1, align="C")
pdf.cell(200, 10, txt="1. The product must be fast.", ln=2)
pdf.cell(200, 10, txt="2. The product must be secure.", ln=2)
pdf.output("sample_doc.pdf")
print("Files created successfully!")
except ImportError:
print("Please install fpdf (pip install fpdf) or provide your own 'sample_doc.pdf'")
Step 2: Handling Text Files
This is the baseline. Text files are simple, but we must handle encoding. If you open a file created on Windows on a Mac (or vice-versa), you might get weird characters unless you specify
utf-8.
def load_text_file(file_path):
"""
Reads a simple .txt file and returns the string.
"""
try:
# always specify encoding='utf-8' to avoid special character crashes
with open(file_path, "r", encoding="utf-8") as f:
return f.read()
except Exception as e:
return f"Error reading text file: {e}"
# Test it
print("--- Text File Output ---")
print(load_text_file("sample_notes.txt"))
Step 3: Handling PDFs with
pypdf
Here is where we solve the binary problem. We use
PdfReader. A PDF is made of pages, so we have to loop through them and extract text page by page.
from pypdf import PdfReader
def load_pdf_file(file_path):
"""
Reads a PDF, loops through pages, and extracts text.
"""
text_content = []
try:
reader = PdfReader(file_path)
# Loop through every page in the PDF
for i, page in enumerate(reader.pages):
# Extract text from the page
text = page.extract_text()
# If extract_text() finds nothing, it returns None or empty string
if text:
text_content.append(f"--- Page {i+1} ---\n{text}")
return "\n".join(text_content)
except Exception as e:
return f"Error reading PDF: {e}"
# Test it
print("\n--- PDF File Output ---")
print(load_pdf_file("sample_doc.pdf"))
Note: You will see that pypdf does a decent job, but sometimes headers and footers get mixed into the main text. This is a common challenge in RAG systems.
Step 4: Handling HTML with
BeautifulSoup
If you load HTML raw, you get
. We don't want the tags. We want the text inside them. However, we also don't want text inside