Chunking Strategies
What You'll Build Today
Welcome to Day 38! We are deep in Phase 5: RAG Systems (Retrieval-Augmented Generation).
Yesterday, you learned how to turn text into numbers (embeddings). Today, we tackle a purely structural problem that makes or breaks your AI application: Chunking.
Imagine you have a 300-page PDF of a technical manual. You want to ask an AI, "How do I reset the pressure valve?"
If you turn the entire 300 pages into one single vector (one list of numbers), the specific details about the pressure valve get "washed out" by the average of the other 299 pages. The search becomes muddy.
To fix this, we break the document into smaller pieces, or "chunks." Today, you will build a Chunking Visualizer.
You will learn:
* Fixed-Size Chunking: Why slicing text strictly by character count is fast but dangerous.
* Chunk Overlap: Why we repeat the last few words of one chunk at the start of the next (the "safety net").
* Semantic/Separator Chunking: How to respect the natural pauses in text (paragraphs and sentences) to keep ideas intact.
* The Goldilocks Zone: Understanding the tradeoff between chunks that are too small (no context) and too big (too much noise).
Let's start slicing.
---
The Problem
We want to build a system where we can find specific information in a long text.
Here is the scenario: You have a long string of text. You want to feed it into a system (like a Vector Database, which we cover tomorrow) to make it searchable.
The naive approach is: "Computers are fast. Let's just chop the text into 100-character blocks."
Look at what happens when we do this without thinking about the content.
The Broken Code
# Our source text: A mock company policy
text = """
Security Policy:
Passwords must be at least 12 characters long.
Two-factor authentication is required for all admin accounts.
Physical keys must be returned to HR upon termination.
"""
# The Naive Approach: Hard slice every 20 characters
chunk_size = 20
chunks = []
for i in range(0, len(text), chunk_size):
# Just slice the string strictly
chunk = text[i:i + chunk_size]
chunks.append(chunk)
# Let's see the result
print(f"Total chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks):
print(f"Chunk {i}: '{chunk}'")
The Pain
Run that code. Look at the output. It likely looks something like this:
Chunk 0: '
Security Policy:
1.'
Chunk 1: ' Passwords must be a'
Chunk 2: 't least 12 character'
Chunk 3: 's long.
Two-facto'
Why is this painful?
Two-facto. The word "factor" was cut in half. If you search for "factor," this chunk might not match.t least 12 character. Without the context from Chunk 1 ("Passwords must be a..."), Chunk 2 is meaningless. If the AI retrieves only Chunk 2, it knows something involves "12 characters," but it doesn't know what.There has to be a smarter way to cut the cake so we don't slice right through the decorations.
---
Let's Build It
We represent text as strings. To fix the pain above, we need to handle boundaries intelligently. We will build a "Smart Chunker" in progressive steps.
Step 1: Setup and Data
First, let's create a more robust dataset and a helper function to visualize our chunks clearly. We will use a visual separator so we can see exactly where one chunk ends and the next begins.
# A longer, more realistic text with paragraphs
document_text = """
The Apollo 11 mission was the first spaceflight that landed the first two people on the Moon.
Commander Neil Armstrong and lunar module pilot Buzz Aldrin, both American, landed the Apollo Lunar Module Eagle on July 20, 1969, at 20:17 UTC.
Armstrong became the first person to step onto the lunar surface six hours and 39 minutes later on July 21 at 02:56 UTC; Aldrin joined him 19 minutes later.
They spent about two and a quarter hours together outside the spacecraft, and they collected 47.5 pounds (21.5 kg) of lunar material to bring back to Earth.
Command module pilot Michael Collins flew the Command and Service Module Columbia alone in lunar orbit while they were on the Moon's surface.
Armstrong and Aldrin spent 21 hours, 36 minutes on the lunar surface at a site they named Tranquility Base before lifting off to rejoin Columbia in lunar orbit.
"""
def visualize_chunks(chunk_list):
print(f"--- Generated {len(chunk_list)} Chunks ---")
for i, chunk in enumerate(chunk_list):
print(f"[{i}]: {repr(chunk)}") # repr() shows special characters like \n
print("-" * 40)
print("Setup complete. Text loaded.")
Step 2: Respecting Words (Splitting by Separator)
The first fix is simple: Don't cut in the middle of a word. Instead of slicing by character index, we can split the text into a list of words, group them until they reach a limit, and then start a new group.
However, an easier way in Python is to use the .split() method. Let's try splitting by sentences or paragraphs first. This is often called "Semantic Chunking" because a sentence usually contains a complete thought.
def split_by_paragraph(text):
# Split by double newline, which usually indicates a paragraph break
chunks = text.strip().split('\n\n')
# Filter out any empty strings just in case
return [c.strip() for c in chunks if c.strip()]
paragraph_chunks = split_by_paragraph(document_text)
print("Strategy: Paragraph Split")
visualize_chunks(paragraph_chunks)
Why this matters:
Run this. You will see 3 distinct chunks. Each is a complete paragraph. This is much better than cutting words in half!
The Downside:What if one paragraph is 2000 words long? That might be too big for our embedding model or context window. We need a way to enforce a maximum size while still trying to respect boundaries.
Step 3: The "Greedy" Chunker with Size Limits
Now we will build a function that adds words to a chunk one by one. If adding the next word exceeds our max_size, we "close" the current chunk and start a new one.
This ensures no chunk is too big, and no word is cut in half.
def greedy_chunker(text, max_chunk_size=100):
words = text.split(' ') # Split text into individual words
current_chunk = []
current_length = 0
all_chunks = []
for word in words:
# Calculate length of word plus a space
word_len = len(word) + 1
# If adding this word exceeds the limit...
if current_length + word_len > max_chunk_size:
# 1. Save the current chunk
all_chunks.append(" ".join(current_chunk))
# 2. Start a new chunk with the current word
current_chunk = [word]
current_length = word_len
else:
# Otherwise, just add the word to the current chunk
current_chunk.append(word)
current_length += word_len
# Don't forget to append the final chunk after the loop ends!
if current_chunk:
all_chunks.append(" ".join(current_chunk))
return all_chunks
# Let's try with a small size to force splits
print("Strategy: Greedy Word Split (Max 100 chars)")
greedy_chunks = greedy_chunker(document_text, max_chunk_size=100)
visualize_chunks(greedy_chunks)
Why this matters:
We now have consistency. We know roughly how big our chunks are (good for computer memory), and we know words aren't broken (good for meaning).
Step 4: Adding Overlap (The Context Bridge)
Look at the output of Step 3. You might see a sentence cut in half simply because the chunk limit was reached.
* Chunk A ends with: "...Armstrong became the first"
* Chunk B starts with: "person to step onto..."
If we search for "Who was the first person?", Chunk A has the setup, and Chunk B has the resolution, but neither has the full answer.
The Solution: Overlap.When we start a new chunk, we shouldn't start from zero. We should include the last few words from the previous chunk. This creates a "sliding window."
def overlapping_chunker(text, chunk_size=100, overlap_size=20):
words = text.split(' ')
chunks = []
# We will use a while loop to manage our position in the word list
i = 0
while i < len(words):
# Create a chunk from current position
# We need to figure out how many words fit in 'chunk_size' characters
# This is a bit complex with raw strings, so let's simplify:
# We will chunk by WORD COUNT for this example, not characters.
# It is much easier to understand logic-wise.
# Take a slice of words
current_batch = words[i : i + chunk_size]
chunk_text = " ".join(current_batch)
chunks.append(chunk_text)
# Move our pointer forward
# But NOT by the full chunk_size.
# We step back by 'overlap_size' to repeat data.
step = chunk_size - overlap_size
# Ensure we always move forward at least 1 word to avoid infinite loops
if step < 1: step = 1
i += step
return chunks
# Chunk by WORDS now.
# Size 20 words, Overlap 5 words.
print("Strategy: Sliding Window (Size 20 words, Overlap 5 words)")
overlap_chunks = overlapping_chunker(document_text, chunk_size=20, overlap_size=5)
visualize_chunks(overlap_chunks)
Analyze the Output:
Look closely at the end of Chunk 0 and the start of Chunk 1.
* Chunk 0 ends: "...July 20, 1969, at 20:17 UTC."
* Chunk 1 starts: "20, 1969, at 20:17 UTC. Armstrong became..."
See the repetition? That is the Overlap. It ensures that the concept of the date and time is preserved in both chunks, connecting the previous thought to the next one (Armstrong).
Step 5: Putting it all together
In professional settings, we combine these strategies. We try to split by paragraphs first. If a paragraph is too big, we split by sentences. If a sentence is too big, we split by words.
This is often called Recursive Character Text Splitting.
Here is a simplified implementation that prioritizes paragraphs, but falls back to fixed size if needed.
def smart_chunker(text, max_chars=150):
# 1. Split by Paragraphs first (Semantic split)
paragraphs = text.strip().split('\n\n')
final_chunks = []
for para in paragraphs:
# Clean whitespace
para = para.strip()
if not para: continue
# 2. Check if paragraph fits in one chunk
if len(para) <= max_chars:
final_chunks.append(para)
else:
# 3. If too big, apply our overlapping chunker to this specific paragraph
# (Using a simplified character slice for brevity here)
# In a real app, you'd split this paragraph by sentences first.
start = 0
while start < len(para):
end = start + max_chars
chunk = para[start:end]
final_chunks.append(chunk)
# Overlap by 20 chars for the next slice
start = end - 20
return final_chunks
print("Strategy: Smart Hybrid (Paragraphs + Fallback)")
smart_chunks = smart_chunker(document_text, max_chars=200)
visualize_chunks(smart_chunks)
---
Now You Try
You have the basic logic. Now, extend the overlapping_chunker to handle these specific scenarios:
. instead of spaces. Then group those sentences until they hit a character limit.#). Test it with a string like "# Header 1\nContent...\n# Header 2\nContent...".visualize_chunks function to print the estimated token count for each chunk (len(chunk) / 4).---
Challenge Project: The Chunking Bake-off
Your challenge is to build a comparison tool that measures which chunking strategy works best for a specific query.
Requirements:document_text from Step 1.* Small chunks (50 characters, 0 overlap).
* Medium chunks (150 characters, 20 overlap).
* Large chunks (Paragraph based).
* Strategy Name.
* Total Chunks generated.
* The specific chunk text that matched the answer.
* How much "extra noise" (irrelevant text) was in that chunk.
Example Output:Query: "When did they land?" (Looking for 'July 20')
Strategy: Small Chunks
- Found in: "on July 20, 1969"
- Noise: Low (Good precision, but might lack context)
Strategy: Paragraph Chunks
- Found in: "The Apollo 11 mission... [200 chars] ... July 20, 1969..."
- Noise: High (Lots of irrelevant text included)
Hint:
To calculate "Noise," you can compare the length of the chunk to the length of the answer you were looking for.
---
What You Learned
Today you tackled the "Pre-processing" stage of RAG. It's not glamorous, but it is essential.
* Fixed-size chunking is fast but breaks meaning.
* Semantic chunking (paragraphs/sentences) preserves ideas but varies wildly in size.
* Overlap is the duct tape that holds context together across cuts.
* Tradeoffs: Small chunks are precise but miss the "big picture." Large chunks capture the picture but confuse the search engine with too much detail.
Why This Matters:In a real AI application, if your chunks are bad, your AI is blind. The best AI model in the world cannot answer a question if the relevant information was cut in half or buried in a mountain of irrelevant text.
Tomorrow: Now that we have clean chunks, how do we search through 10,000 of them in milliseconds? We enter the world of Vector Databases.