Day 38 of 80

Chunking Strategies

Phase 5: RAG Systems

What You'll Build Today

Welcome to Day 38! We are deep in Phase 5: RAG Systems (Retrieval-Augmented Generation).

Yesterday, you learned how to turn text into numbers (embeddings). Today, we tackle a purely structural problem that makes or breaks your AI application: Chunking.

Imagine you have a 300-page PDF of a technical manual. You want to ask an AI, "How do I reset the pressure valve?"

If you turn the entire 300 pages into one single vector (one list of numbers), the specific details about the pressure valve get "washed out" by the average of the other 299 pages. The search becomes muddy.

To fix this, we break the document into smaller pieces, or "chunks." Today, you will build a Chunking Visualizer.

You will learn:

* Fixed-Size Chunking: Why slicing text strictly by character count is fast but dangerous.

* Chunk Overlap: Why we repeat the last few words of one chunk at the start of the next (the "safety net").

* Semantic/Separator Chunking: How to respect the natural pauses in text (paragraphs and sentences) to keep ideas intact.

* The Goldilocks Zone: Understanding the tradeoff between chunks that are too small (no context) and too big (too much noise).

Let's start slicing.

---

The Problem

We want to build a system where we can find specific information in a long text.

Here is the scenario: You have a long string of text. You want to feed it into a system (like a Vector Database, which we cover tomorrow) to make it searchable.

The naive approach is: "Computers are fast. Let's just chop the text into 100-character blocks."

Look at what happens when we do this without thinking about the content.

The Broken Code

# Our source text: A mock company policy
text = """
Security Policy:
Passwords must be at least 12 characters long.
Two-factor authentication is required for all admin accounts.
Physical keys must be returned to HR upon termination.
"""

# The Naive Approach: Hard slice every 20 characters
chunk_size = 20
chunks = []

for i in range(0, len(text), chunk_size):
    # Just slice the string strictly
    chunk = text[i:i + chunk_size]
    chunks.append(chunk)

# Let's see the result
print(f"Total chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: '{chunk}'")

The Pain

Run that code. Look at the output. It likely looks something like this:

Chunk 0: '
Security Policy:
1.'
Chunk 1: ' Passwords must be a'
Chunk 2: 't least 12 character'
Chunk 3: 's long.
Two-facto'

Why is this painful?

Broken Words: Look at Chunk 3: Two-facto. The word "factor" was cut in half. If you search for "factor," this chunk might not match.

Lost Meaning: Look at Chunk 2: t least 12 character. Without the context from Chunk 1 ("Passwords must be a..."), Chunk 2 is meaningless. If the AI retrieves only Chunk 2, it knows something involves "12 characters," but it doesn't know what.

Search Failure: If you ask "What is the policy on passwords?", the answer is spread across three different disconnected pieces.

There has to be a smarter way to cut the cake so we don't slice right through the decorations.

---

Let's Build It

We represent text as strings. To fix the pain above, we need to handle boundaries intelligently. We will build a "Smart Chunker" in progressive steps.

Step 1: Setup and Data

First, let's create a more robust dataset and a helper function to visualize our chunks clearly. We will use a visual separator so we can see exactly where one chunk ends and the next begins.

# A longer, more realistic text with paragraphs
document_text = """
The Apollo 11 mission was the first spaceflight that landed the first two people on the Moon. 
Commander Neil Armstrong and lunar module pilot Buzz Aldrin, both American, landed the Apollo Lunar Module Eagle on July 20, 1969, at 20:17 UTC.

Armstrong became the first person to step onto the lunar surface six hours and 39 minutes later on July 21 at 02:56 UTC; Aldrin joined him 19 minutes later. 
They spent about two and a quarter hours together outside the spacecraft, and they collected 47.5 pounds (21.5 kg) of lunar material to bring back to Earth. 
Command module pilot Michael Collins flew the Command and Service Module Columbia alone in lunar orbit while they were on the Moon's surface.

Armstrong and Aldrin spent 21 hours, 36 minutes on the lunar surface at a site they named Tranquility Base before lifting off to rejoin Columbia in lunar orbit.
"""

def visualize_chunks(chunk_list):
    print(f"--- Generated {len(chunk_list)} Chunks ---")
    for i, chunk in enumerate(chunk_list):
        print(f"[{i}]: {repr(chunk)}") # repr() shows special characters like \n
    print("-" * 40)

print("Setup complete. Text loaded.")

Step 2: Respecting Words (Splitting by Separator)

The first fix is simple: Don't cut in the middle of a word. Instead of slicing by character index, we can split the text into a list of words, group them until they reach a limit, and then start a new group.

However, an easier way in Python is to use the .split() method. Let's try splitting by sentences or paragraphs first. This is often called "Semantic Chunking" because a sentence usually contains a complete thought.

def split_by_paragraph(text):
    # Split by double newline, which usually indicates a paragraph break
    chunks = text.strip().split('\n\n')
    
    # Filter out any empty strings just in case
    return [c.strip() for c in chunks if c.strip()]

paragraph_chunks = split_by_paragraph(document_text)

print("Strategy: Paragraph Split")
visualize_chunks(paragraph_chunks)

Why this matters:

Run this. You will see 3 distinct chunks. Each is a complete paragraph. This is much better than cutting words in half!

The Downside:

What if one paragraph is 2000 words long? That might be too big for our embedding model or context window. We need a way to enforce a maximum size while still trying to respect boundaries.

Step 3: The "Greedy" Chunker with Size Limits

Now we will build a function that adds words to a chunk one by one. If adding the next word exceeds our max_size, we "close" the current chunk and start a new one.

This ensures no chunk is too big, and no word is cut in half.

def greedy_chunker(text, max_chunk_size=100):
    words = text.split(' ') # Split text into individual words
    current_chunk = []
    current_length = 0
    all_chunks = []
    
    for word in words:
        # Calculate length of word plus a space
        word_len = len(word) + 1 
        
        # If adding this word exceeds the limit...
        if current_length + word_len > max_chunk_size:
            # 1. Save the current chunk
            all_chunks.append(" ".join(current_chunk))
            
            # 2. Start a new chunk with the current word
            current_chunk = [word]
            current_length = word_len
        else:
            # Otherwise, just add the word to the current chunk
            current_chunk.append(word)
            current_length += word_len
            
    # Don't forget to append the final chunk after the loop ends!
    if current_chunk:
        all_chunks.append(" ".join(current_chunk))
        
    return all_chunks

# Let's try with a small size to force splits
print("Strategy: Greedy Word Split (Max 100 chars)")
greedy_chunks = greedy_chunker(document_text, max_chunk_size=100)
visualize_chunks(greedy_chunks)

Why this matters:

We now have consistency. We know roughly how big our chunks are (good for computer memory), and we know words aren't broken (good for meaning).

Step 4: Adding Overlap (The Context Bridge)

Look at the output of Step 3. You might see a sentence cut in half simply because the chunk limit was reached.

* Chunk A ends with: "...Armstrong became the first"

* Chunk B starts with: "person to step onto..."

If we search for "Who was the first person?", Chunk A has the setup, and Chunk B has the resolution, but neither has the full answer.

The Solution: Overlap.

When we start a new chunk, we shouldn't start from zero. We should include the last few words from the previous chunk. This creates a "sliding window."

def overlapping_chunker(text, chunk_size=100, overlap_size=20):
    words = text.split(' ')
    chunks = []
    
    # We will use a while loop to manage our position in the word list
    i = 0
    while i < len(words):
        # Create a chunk from current position
        # We need to figure out how many words fit in 'chunk_size' characters
        # This is a bit complex with raw strings, so let's simplify:
        # We will chunk by WORD COUNT for this example, not characters.
        # It is much easier to understand logic-wise.
        
        # Take a slice of words
        current_batch = words[i : i + chunk_size]
        chunk_text = " ".join(current_batch)
        chunks.append(chunk_text)
        
        # Move our pointer forward
        # But NOT by the full chunk_size. 
        # We step back by 'overlap_size' to repeat data.
        step = chunk_size - overlap_size
        
        # Ensure we always move forward at least 1 word to avoid infinite loops
        if step < 1: step = 1
            
        i += step
        
    return chunks

# Chunk by WORDS now. 
# Size 20 words, Overlap 5 words.
print("Strategy: Sliding Window (Size 20 words, Overlap 5 words)")
overlap_chunks = overlapping_chunker(document_text, chunk_size=20, overlap_size=5)
visualize_chunks(overlap_chunks)

Analyze the Output:

Look closely at the end of Chunk 0 and the start of Chunk 1.

* Chunk 0 ends: "...July 20, 1969, at 20:17 UTC."

* Chunk 1 starts: "20, 1969, at 20:17 UTC. Armstrong became..."

See the repetition? That is the Overlap. It ensures that the concept of the date and time is preserved in both chunks, connecting the previous thought to the next one (Armstrong).

Step 5: Putting it all together

In professional settings, we combine these strategies. We try to split by paragraphs first. If a paragraph is too big, we split by sentences. If a sentence is too big, we split by words.

This is often called Recursive Character Text Splitting.

Here is a simplified implementation that prioritizes paragraphs, but falls back to fixed size if needed.

def smart_chunker(text, max_chars=150):
    # 1. Split by Paragraphs first (Semantic split)
    paragraphs = text.strip().split('\n\n')
    final_chunks = []
    
    for para in paragraphs:
        # Clean whitespace
        para = para.strip()
        if not para: continue
            
        # 2. Check if paragraph fits in one chunk
        if len(para) <= max_chars:
            final_chunks.append(para)
        else:
            # 3. If too big, apply our overlapping chunker to this specific paragraph
            # (Using a simplified character slice for brevity here)
            # In a real app, you'd split this paragraph by sentences first.
            start = 0
            while start < len(para):
                end = start + max_chars
                chunk = para[start:end]
                final_chunks.append(chunk)
                
                # Overlap by 20 chars for the next slice
                start = end - 20 
                
    return final_chunks

print("Strategy: Smart Hybrid (Paragraphs + Fallback)")
smart_chunks = smart_chunker(document_text, max_chars=200)
visualize_chunks(smart_chunks)

---

Now You Try

You have the basic logic. Now, extend the overlapping_chunker to handle these specific scenarios:

The Sentence Splitter: Modify the logic to split the text by periods . instead of spaces. Then group those sentences until they hit a character limit.

The Markdown Aware: Create a chunker that splits specifically on Markdown headers (lines starting with #). Test it with a string like "# Header 1\nContent...\n# Header 2\nContent...".

The Token Estimator: LLMs count "tokens," not characters. A rough rule of thumb is 1 token ≈ 4 characters. Update your visualize_chunks function to print the estimated token count for each chunk (len(chunk) / 4).

---

Challenge Project: The Chunking Bake-off

Your challenge is to build a comparison tool that measures which chunking strategy works best for a specific query.

Requirements:

Define a "Search Query" (e.g., "When did they land?").

Take the document_text from Step 1.

Process the text using three different strategies:

* Small chunks (50 characters, 0 overlap).

* Medium chunks (150 characters, 20 overlap).

* Large chunks (Paragraph based).

For each strategy, check which chunk contains the answer (you can do this by checking if the keyword "1969" or "July 20" exists in the chunk string).

Print a report showing:

* Strategy Name.

* Total Chunks generated.

* The specific chunk text that matched the answer.

* How much "extra noise" (irrelevant text) was in that chunk.

Example Output:

Query: "When did they land?" (Looking for 'July 20')

Strategy: Small Chunks

Found in: "on July 20, 1969"
Noise: Low (Good precision, but might lack context)


Strategy: Paragraph Chunks

Found in: "The Apollo 11 mission... [200 chars] ... July 20, 1969..."
Noise: High (Lots of irrelevant text included)

Hint:

To calculate "Noise," you can compare the length of the chunk to the length of the answer you were looking for.

---

What You Learned

Today you tackled the "Pre-processing" stage of RAG. It's not glamorous, but it is essential.

* Fixed-size chunking is fast but breaks meaning.

* Semantic chunking (paragraphs/sentences) preserves ideas but varies wildly in size.

* Overlap is the duct tape that holds context together across cuts.

* Tradeoffs: Small chunks are precise but miss the "big picture." Large chunks capture the picture but confuse the search engine with too much detail.

Why This Matters:

In a real AI application, if your chunks are bad, your AI is blind. The best AI model in the world cannot answer a question if the relevant information was cut in half or buried in a mountain of irrelevant text.

Tomorrow: Now that we have clean chunks, how do we search through 10,000 of them in milliseconds? We enter the world of Vector Databases.

← Day 37 Day 39 →