Day 32 of 80

Chain of Thought & Reasoning

Phase 4: Prompt Engineering

What You'll Build Today

Today, we are going to build a "Sherlock Holmes" Logic Solver.

Up until now, we have treated the AI as a magic answer box: you put a question in, and an answer pops out. But for complex problems—math, logic puzzles, or strategic planning—this "magic box" approach often fails. The model tries to guess the final answer immediately without doing the necessary work to get there.

Today, we are going to force the model to show its work. You will build a tool that takes a tricky logic puzzle, reasons through it step-by-step in front of you, and only then provides the final answer.

Here is what you will learn:

* Chain of Thought (CoT) Prompting: Why asking the model to "show its work" actually makes it smarter.

* Zero-Shot Reasoning: How a single magic phrase ("Let's think step by step") can drastically improve performance without extra data.

* Few-Shot Reasoning: How to teach the model a specific way to think by giving it examples of logic.

* Output Parsing: How to separate the AI's "thoughts" from its final "answer" so your code can use the result.

Let's turn that black box into a glass box.

The Problem

Imagine you are asking a brilliant but impulsive student a trick question. If they answer immediately, they will likely go with their gut instinct, which is often wrong. If you tell them, "Wait, write down your math first," they almost always get it right.

Large Language Models (LLMs) are exactly like that impulsive student. They are probabilistic engines—they predict the next word. If you ask a complex question, they try to predict the answer immediately based on what words usually follow that question on the internet. They don't naturally "pause" to calculate.

Let's look at a classic example: The "Bat and Ball" problem. This is a famous cognitive reflection test.

The Riddle: "A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?" Intuitively, your brain (and the AI's "gut") wants to say $0.10. But if the ball is $0.10, the bat is $1.10, and the total is $1.20. The correct answer is $0.05.

Here is the code that demonstrates this failure (note: sophisticated models like GPT-4 have seen this specific riddle often enough to memorize it, but this logic failure happens constantly with new, unique problems).

import os
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# A slightly modified riddle to prevent the model from just reciting a memorized answer
# We change the numbers to force it to do math.
riddle = """
A toaster and a bagel cost $22.00 in total.
The toaster costs $20.00 more than the bagel.
How much does the bagel cost?
"""

response = client.chat.completions.create(
    model="gpt-3.5-turbo", # Using 3.5 to highlight the logic failure more easily
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": riddle}
    ],
    temperature=0 # We want the most deterministic answer
)

print("AI Answer:")
print(response.choices[0].message.content)

Likely Output (The Pain):

AI Answer:
The bagel costs $2.00.

Why this hurts:

The model sees "$22" and "$20" and "difference." Its statistical pattern matching urges it to subtract 22 - 20 = 2. It is confidently wrong. If you were building a financial application or a tutoring bot, this kind of error is unacceptable.

We need a way to force the model to slow down.

Let's Build It

We are going to fix this using Chain of Thought (CoT) prompting. We will guide the model to generate a "thought trace" before it generates the answer.

Step 1: Zero-Shot Chain of Thought

The simplest way to induce reasoning is a technique discovered by researchers in 2022. It sounds like magic, but it works. We simply append the phrase: "Let's think step by step."

By generating the words "Step 1...", the model changes its own context. When it predicts the next word, it is now predicting based on the reasoning it just wrote, not just the original question.

import os
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

riddle = """
A toaster and a bagel cost $22.00 in total.
The toaster costs $20.00 more than the bagel.
How much does the bagel cost?
"""

# The Magic Phrase
prompt = riddle + "\nLet's think step by step."

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
)

print("AI Reasoning & Answer:")
print(response.choices[0].message.content)

Result:

The model will now likely output something like:

> Let x be the cost of the bagel.

> Then the toaster costs x + 20.

> Total cost is x + (x + 20) = 22.

> 2x + 20 = 22.

> 2x = 2.

> x = 1.

> The bagel costs $1.00.

It got it right! Just by asking it to think.

Step 2: Few-Shot Chain of Thought (Manual Logic)

"Let's think step by step" is great, but sometimes the model rambles. For production applications, we often want a specific format of reasoning. We can use Few-Shot Prompting (giving examples) combined with Chain of Thought.

We will give the model an example of how we want it to solve problems.

system_prompt = """
You are a logic solver. When presented with a problem, you must follow this format:
Thought: [Your step-by-step reasoning process]
Answer: [The final numerical or short answer]
"""

# We provide a 'shot' (example) of how to reason correctly
example_riddle = """
Problem: A bat and a ball cost $1.10. The bat costs $1.00 more than the ball. How much is the ball?
"""

example_solution = """
Thought:
Let b be the price of the ball.
The bat is b + 1.00.
Together they are b + (b + 1.00) = 1.10.
2b + 1.00 = 1.10.
2b = 0.10.
b = 0.05.
Answer: $0.05
"""

new_riddle = """
Problem: A toaster and a bagel cost $22.00 in total. The toaster costs $20.00 more than the bagel. How much is the bagel?
"""

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": example_riddle},
        {"role": "assistant", "content": example_solution}, # Showing the model a good response
        {"role": "user", "content": new_riddle}
    ]
)

print(response.choices[0].message.content)

Why this is better:

We control the structure. We explicitly told it to use the Thought: and Answer: labels. This makes the output predictable.

Step 3: Parsing the Output

Now that we have the "Thought" and the "Answer" in the text, we need to separate them in Python. Maybe you want to show the user the final answer immediately, but keep the "reasoning" hidden in a "Show Details" dropdown (like ChatGPT sometimes does).

We will write a function that splits the text based on our keywords.

def solve_logic_puzzle(puzzle_text):
    system_prompt = """
    You are a logic engine. Solve the problem.
    Format your response exactly like this:
    ### REASONING
    (Your step-by-step logic here)
    ### FINAL ANSWER
    (Your final conclusion here)
    """
    
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": puzzle_text}
        ]
    )
    
    full_text = response.choices[0].message.content
    
    # Simple string parsing to separate the parts
    # We look for the headers we requested
    reasoning = "No reasoning found."
    answer = "No answer found."
    
    if "### REASONING" in full_text and "### FINAL ANSWER" in full_text:
        parts = full_text.split("### FINAL ANSWER")
        reasoning_part = parts[0].replace("### REASONING", "").strip()
        answer_part = parts[1].strip()
        
        return reasoning_part, answer_part
    else:
        # Fallback if the model messed up formatting
        return full_text, "Could not parse answer"

# Test it
my_puzzle = "If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?"

logic, result = solve_logic_puzzle(my_puzzle)

print(f"--- LOGIC TRACE ---\n{logic}\n")
print(f"--- THE ANSWER ---\n{result}")

Step 4: Self-Consistency (The "Tree of Thoughts" Lite)

Sometimes, even with Chain of Thought, the model makes a mistake. A powerful technique called Self-Consistency involves asking the model to solve the problem 3 times, then checking if the answers agree.

While we won't build a full voting algorithm today, let's update our function to run the logic loop multiple times and print all attempts. This helps you verify if the reasoning is robust.

def verify_reasoning(puzzle_text, attempts=3):
    print(f"Solving: {puzzle_text}\n")
    
    for i in range(attempts):
        print(f"--- Attempt {i+1} ---")
        logic, result = solve_logic_puzzle(puzzle_text)
        print(f"Reasoning: {logic[:100]}...") # Just print first 100 chars
        print(f"Final Answer: {result}")
        print("-" * 20)

# A tricky kinship riddle
kinship_riddle = "Points to a man in a photo. 'His mother is my mother's only daughter.' Who is in the photo?"

verify_reasoning(kinship_riddle)

Run this. You might find that 2 out of 3 attempts say "Her son" and 1 says "Herself" (or vice versa). This reveals why relying on a single generation for complex logic is dangerous!

Now You Try

You have a functioning logic solver. Now extend it.

The Code Debugger:

Modify the system_prompt in Step 3. Instead of solving riddles, tell the model it is a Python expert. Pass in a snippet of broken Python code. Ask it to output ### DIAGNOSIS (reasoning about the bug) and ### FIXED CODE.

The "Devil's Advocate":

Write a script that takes an opinion (e.g., "Remote work is always better"). Ask the model to generate ### ARGUMENTS FOR and ### ARGUMENTS AGAINST step-by-step before giving a ### BALANCED CONCLUSION. This forces the model to reason from multiple perspectives.

Math Word Problem Solver:

Find a list of grade-school math word problems (e.g., "Jane has 5 apples..."). Run them through your solver. Experiment with removing the "Let's think step by step" instruction to see if it starts getting them wrong again.

Challenge Project: The Reasoning Showdown

Your challenge is to scientifically prove the value of Chain of Thought. You will build a script that compares "Fast Thinking" (standard prompting) vs. "Slow Thinking" (CoT).

Requirements:

Create a list of 3 difficult logic puzzles or math word problems (search for "Cognitive Reflection Test questions").

Write a loop that iterates through each puzzle.

For each puzzle, run it twice:

* Once with a standard prompt ("Answer this: [puzzle]").

* Once with a CoT prompt ("Answer this, let's think step by step: [puzzle]").

Print the results side-by-side.

Example Output:

Puzzle: The Bat and Ball...
--------------------------------------------------
Standard Prompt Answer: $0.10 (Incorrect)
CoT Prompt Answer: ...algebra steps... Answer: $0.05 (Correct)
--------------------------------------------------

Hints:

* You will need two separate API calls inside your loop.

* Don't worry about automating the "grading" (checking if it's correct) yet—just print the text so you can read the difference.

* Use temperature=0 to ensure the "Standard" prompt fails consistently if the model is prone to failing.

What You Learned

Today you moved beyond simple Q&A and entered the realm of Prompt Engineering for Reasoning.

* CoT (Chain of Thought): You learned that LLMs are better at logic when they have "scratchpad" space to write intermediate steps.

* Zero-Shot CoT: You saw how "Let's think step by step" is a powerful, universal unlock code.

* Delimiters: You learned to use markers like ### REASONING to make AI output parseable by code.

Why This Matters:

In the real world, you rarely ask an AI for a simple fact. You ask it to "Plan a travel itinerary," "Debug this error," or "Analyze this legal contract." All of these require reasoning, not just retrieval. If you don't force the model to think step-by-step, it will hallucinate a plausible-sounding but factually incorrect plan.

Tomorrow:

Now that our prompts are getting powerful, they are also getting dangerous. Tomorrow, we explore Prompt Security. We will learn how hackers try to trick LLMs into revealing secrets or ignoring instructions, and how you can defend against it.

← Day 31 Day 33 →