Day 65 of 80

Monitoring & Observability

Phase 7: Advanced Techniques

Day 65: Monitoring & Observability

What You'll Build Today

Up until now, your AI applications have been a bit of a "black box." You send a prompt in, you wait, and an answer comes out. But what happened in between? Did the LLM hesitate? Did it call a tool three times because the first two failed? How much did that specific query cost you?

Today, we are turning on the lights. You will build a fully observable Research Agent integrated with LangSmith.

Here is what you will master:

* Tracing: Visualizing the exact path your data takes through complex chains and agents.

* Cost Tracking: Seeing exactly how many tokens (and dollars) every single run consumes.

* Latency Debugging: Identifying which step of your chain is slowing everything down.

* Metadata Tagging: Labeling your runs so you can filter them later (e.g., "Development" vs "Production").

By the end of this session, you won't just hope your code works—you will have the X-ray vision to prove it.

---

The Problem

Imagine you have built a customer support bot. It uses three different tools: a database lookup, a policy document search, and a calculator.

A user reports: "The bot took 20 seconds to reply and then gave me the wrong refund amount."

How do you debug this? Without observability, your code looks like this:

import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Setup
model = ChatOpenAI(model="gpt-4o-mini")

# A complex chain (simulated)
prompt = ChatPromptTemplate.from_template("Solve this math problem: {question}")
chain = prompt | model | StrOutputParser()

# The "Debugging" Nightmare
try:
    print("Starting chain...")
    response = chain.invoke({"question": "What is the square root of 144 times 5?"})
    print(f"Response: {response}")
except Exception as e:
    print(f"Error: {e}")

If this fails, or if it takes too long, your console output tells you nothing about why.

* Did the prompt take too long to generate?

* Did the model hang?

* Did the output parser fail?

* If you had tools, which tool received bad inputs?

You might try adding print() statements between every step, but once you have loops and agents, your terminal becomes a wall of unreadable text. You are flying blind.

There has to be a better way to visualize the flow of execution.

---

Let's Build It

We are going to use LangSmith, a platform specifically designed to debug and monitor LLM applications. It integrates natively with LangChain.

Prerequisites:

You need a LangSmith account (sign up at smith.langchain.com).

Generate a LangSmith API Key.

Step 1: The Setup

The beauty of LangSmith is that you often don't need to change your Python code logic. You just need to set environment variables.

Create a new file called observable_agent.py.

import os
import getpass

# 1. Standard OpenAI Setup
if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter OpenAI API Key: ")

# 2. LangSmith Setup (The Magic Part)
# These three lines turn on the "X-Ray Vision"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"

# Enter your LangSmith API Key here
if not os.environ.get("LANGCHAIN_API_KEY"):
    os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("Enter LangSmith API Key: ")

# Optional: Name your project so it's easy to find in the dashboard
os.environ["LANGCHAIN_PROJECT"] = "Day65-Observability-Bootcamp"

print("Observability configured successfully.")

Step 2: Creating a Multi-Step Chain

To see the power of tracing, we need something slightly complex. Let's build a chain that translates a sentence, then summarizes it, then checks the tone.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

model = ChatOpenAI(model="gpt-4o-mini")

# Sub-chain 1: Translate to French
translate_prompt = ChatPromptTemplate.from_template(
    "Translate this text to French: {text}"
)
translate_chain = translate_prompt | model | StrOutputParser()

# Sub-chain 2: Summarize the French text
summarize_prompt = ChatPromptTemplate.from_template(
    "Summarize this French text in one sentence (keep it in French): {french_text}"
)
summarize_chain = summarize_prompt | model | StrOutputParser()

# Sub-chain 3: Tone Check (in English)
tone_prompt = ChatPromptTemplate.from_template(
    "Analyze the tone of this French statement. Reply in English: {summary}"
)
tone_chain = tone_prompt | model | StrOutputParser()

# Combine them into a sequence
# We use a dictionary to pass output from one to the next
full_chain = (
    {"french_text": translate_chain} 
    | RunnablePassthrough.assign(summary=summarize_chain)
    | RunnablePassthrough.assign(tone=tone_chain)
)

print("Chain created.")

Step 3: Running and Tracing

Now, let's run this chain. Because we set the environment variables in Step 1, this run will automatically be sent to the cloud.

input_text = """
I am absolutely furious about the service I received today. 
The waiter ignored us for 30 minutes, the food was cold, and 
when I complained, the manager rolled his eyes. I will never come back here.
"""

print("--- Invoking Chain ---")
result = full_chain.invoke({"text": input_text})

print("\n--- Final Result ---")
print(f"Original: {input_text[:50]}...")
print(f"French Translation: {result['french_text']}")
print(f"French Summary: {result['summary']}")
print(f"Tone Analysis: {result['tone']}")

print("\nCheck your LangSmith dashboard now!")

What to do now:

Run the code.

Go to smith.langchain.com.

Click on the project "Day65-Observability-Bootcamp".

Click on the run that just appeared.

What you will see:

You won't just see the final output. You will see a "Waterfall" visualization.

* Root: The RunnableSequence (the whole app).

* Child 1: translate_chain -> showing the exact prompt sent to OpenAI and the raw response.

* Child 2: summarize_chain -> showing the French input it received.

* Child 3: tone_chain.

You can click on any bar in the chart to see the latency (how long it took) and the token usage (cost) for that specific step.

Step 4: Adding Metadata (Context is King)

In a real app, you have different users, different environments (Dev vs Prod), and different model versions. You can tag your traces so you can filter them later.

# We can pass a 'config' dictionary to invoke
config = {
    "tags": ["experiment-A", "tone-analysis"],
    "metadata": {
        "user_id": "user_123",
        "environment": "development",
        "model_version": "gpt-4o-mini"
    }
}

print("\n--- Invoking with Metadata ---")
# Running a different input
positive_text = "The sunrise was beautiful and the coffee was perfect."

result_2 = full_chain.invoke({"text": positive_text}, config=config)

print("Run complete. Filter by 'user_123' in LangSmith to find this specific run.")

Step 5: Tracing a Tool-Using Agent

Chains are predictable. Agents are chaotic. Tracing is most useful when the AI decides what to do. Let's create a simple tool-using agent to see how the trace looks different.

from langchain_community.tools import DuckDuckGoSearchRun
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate

# 1. Define Tool
search = DuckDuckGoSearchRun()
tools = [search]

# 2. Define Agent Prompt
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Use tools to find information."),
    ("user", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

# 3. Create Agent
agent = create_tool_calling_agent(model, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

# 4. Run with Tracing
print("\n--- Running Agent with Tracing ---")
agent_executor.invoke(
    {"input": "Who won the Super Bowl in 2024 and what was the score?"},
    config={"metadata": {"type": "agent_test"}}
)

Check the Dashboard:

When you look at this trace, you will see a loop:

LLM Call: The model decides to call duckduckgo_search.

Tool Call: The tool actually runs (you can see the search query).

LLM Call: The model takes the search result and formulates the final answer.

This lets you answer questions like: "Did the search tool fail to find the answer, or did the LLM ignore the search result?"

---

Now You Try

You have the basics. Now, use observability to investigate your own code.

The "Prompt A/B Test":

* Create two versions of a prompt (e.g., one simple, one very detailed).

* Run both using the same input.

* Use metadata tags ["prompt_v1"] and ["prompt_v2"].

* Go to LangSmith and compare the token count and latency. Which one is cheaper? Which is faster?

The Error Trap:

* Intentionally break your code. Create a tool that divides by zero or raises a ValueError.

* Run the agent.

* Find the trace in LangSmith. Notice how the error is highlighted in red. Click it to see exactly which step threw the exception and what the variables were at that moment.

The "Conversation" Trace:

* Create a simple chatbot loop (using while True).

* Pass a session_id in the metadata for every message in the loop.

* Chat with it for 5 turns.

* Filter your dashboard by that session_id to see the entire conversation history grouped together.

---

Challenge Project: The Bottleneck Hunter

You have been hired to optimize a slow AI script.

The Scenario:

You have a script that generates a blog post. It currently does the following sequentially:

Generates a title.

Generates an outline.

Generates Section 1.

Generates Section 2.

Generates Section 3.

Your Task:

Write this sequential chain using gpt-4o-mini.

Run it and observe the total latency in LangSmith.

Refactor the code to run the section generations in parallel (using RunnableParallel or asyncio).

Run the optimized version.

Compare the two traces.

Requirements:

* Tag the first run ["strategy:sequential"].

* Tag the second run ["strategy:parallel"].

* The output must be a printed report: "Sequential took X seconds. Parallel took Y seconds."

Hint:

In LangSmith, the "Waterfall" view for the parallel version should look like a staircase where multiple bars overlap (start at the same time), whereas the sequential version looks like a long staircase where one step starts only after the previous ends.

---

What You Learned

Today you stopped guessing and started measuring.

* Tracing: You learned how to visualize the flow of data through chains and agents using LANGCHAIN_TRACING_V2.

* Debugging: You saw how to pinpoint exactly where an error occurred in a complex logic flow.

* Metadata: You learned to tag runs to organize your data by user, environment, or experiment.

* Performance: You learned to look at Latency and Token Usage to understand cost and speed.

Why This Matters:

In a production environment, "it works on my machine" isn't enough. You need to know if your AI is costing you $10 a day or $1000 a day. You need to know if users are getting errors when you aren't looking. Observability gives you the confidence to deploy.

Tomorrow:

Now that you can see how much your repetitive queries are costing you, you probably want to lower that bill. Tomorrow, we cover Caching—how to save money and time by remembering answers you've already generated.

← Day 64 Day 66 →