Day 30 of 80

Streaming & Token Management

Phase 3: LLM Landscape & APIs

What You'll Build Today

Welcome to Day 30. You have made it incredibly far. You know how to talk to an LLM, how to give it personality, and how to build a basic chatbot. But if you were to release your chatbot to the public right now, you would face two major complaints:

"It's too slow. I stare at a blank screen for 10 seconds before it answers."

"I have no idea how much this conversation is costing me."

Today, we are going to fix both of those issues. We are going to build a Real-Time Streaming Chatbot with Cost Tracking.

Here is exactly what you will learn and why:

* Streaming Responses: Instead of waiting for the full answer, you will learn to print text character-by-character as it is generated. This is the difference between a frustrating loading spinner and a magical "typing" effect.

* Token Management (tiktoken): You will learn how LLMs actually read text (it is not by word). You will use the tiktoken library to count exactly how much "memory" you are using.

* Cost Calculation: You will write logic to calculate the price of every interaction in real-time, so you never get a surprise bill.

* Context Window Management: You will learn what happens when a conversation gets too long and how to handle the limit.

This is the day your code goes from "hobby script" to "professional application behavior." Let's get started.

The Problem

Let's look at the pain point first. Imagine you ask ChatGPT to write a 500-word essay.

If you use the code we have written in previous days, your Python script sends the request and then... waits. Your program freezes. The user sees a blinking cursor. Five seconds pass. Ten seconds. The user thinks the program has crashed. Finally, bam, a giant wall of text appears all at once.

Here is the code that causes this bad user experience. Do not run this, just look at it:

import os
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

print("User: Write a short story about a robot learning to paint.")
print("AI is thinking...")

# THE PAIN POINT: The code blocks here for 10+ seconds
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Write a short story about a robot learning to paint."}]
)

# Only after the WHOLE story is generated does this print run
print(response.choices[0].message.content)

Why this hurts:

Perceived Latency: Even if the AI is fast, waiting for the entire response makes it feel slow.

No Feedback: The user doesn't know if the request failed or is just processing.

The "Black Box" of Cost: You have no idea how many tokens that story used until after you paid for it. If the AI hallucinates a 5,000-word novel, you pay for all of it.

There has to be a way to see the text as it is being created, just like on the ChatGPT website. And there has to be a way to measure the size of the request before we send it.

Let's Build It

We are going to solve this in steps. First, we will fix the speed perception using Streaming. Then, we will tackle the Token Counting.

Step 1: Setting Up the Environment

First, we need to install a new library. OpenAI doesn't count words; it counts "tokens" (chunks of characters). To count them accurately in Python, we need a library called tiktoken.

Action: Open your terminal and install the libraries:

``bash


pip install openai tiktoken

Now, create a new file called streaming_bot.py. We will start with our standard imports.



import os
import time
from openai import OpenAI

# Initialize the client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Let's define a model to use. 
# GPT-4o-mini is fast and cheap, perfect for testing.
MODEL = "gpt-4o-mini"


Step 2: implementing Streaming

To stream text, we make a small change to our API call: stream=True.

When we do this, the API doesn't return a single response object. Instead, it returns an "iterator"—a stream of tiny data chunks. We have to loop through these chunks as they arrive.



Add this code to your file to see the raw chunks:

print("--- STREAMING DEMO ---")

# 1. Create the request with stream=True
stream = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "user", "content": "Count to 5 and tell me a joke."}],
    stream=True  # <--- This is the magic switch
)

# 2. Iterate through the stream
print("Receiving chunks...")
for chunk in stream:
    # 3. Extract the content delta
    # In streaming, we look for 'delta', not 'message'
    content = chunk.choices[0].delta.content
    
    if content:
        print(f"Chunk received: '{content}'")


Run this code.
You will see the output looks strange. It prints each word (or part of a word) on a new line.

Output example:
Chunk received: 'Count'
Chunk received: 'ing'
Chunk received: ' to'
Chunk received: ' '
Chunk received: '5'
...


This proves it's working! We are receiving data instantly, piece by piece.

Step 3: Making it Look Like Typing

The previous output was ugly. We want the text to flow across the screen. To do this, we use the Python print function with the argument end="". By default, print adds a newline. We tell it to add nothing instead, so the next chunk prints right next to the current one.



Replace the loop in your code with this version:

print("\n--- SMOOTH STREAMING ---")

stream = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "user", "content": "Explain quantum physics in one sentence."}],
    stream=True
)

print("AI: ", end="") # Start the line

full_response = ""

for chunk in stream:
    content = chunk.choices[0].delta.content
    
    if content:
        # Print to console without a newline
        print(content, end="", flush=True)
        
        # We also need to save the text to memory to use it later
        full_response += content

print() # Print a final newline at the very end
print("-" * 20)
print(f"Final stored string: {full_response}")


Why

flush=True?

Sometimes Python buffers output (saves it up to print in a big batch). flush=True forces Python to put the text on the screen immediately, which is crucial for that smooth typing effect.



Step 4: Counting Tokens with Tiktoken

Now that we have the visual part handled, let's talk about the hidden cost.

LLMs read tokens. A token is roughly 0.75 words.
*   "apple" = 1 token
*   "ing" = 1 token
*   "The" = 1 token

If you send a prompt, you need to know how many tokens it is to estimate the cost. We use tiktoken for this.



Add this helper function to the top of your script:

import tiktoken

def count_tokens(text, model="gpt-4o-mini"):
    try:
        # Get the encoding for the specific model
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        # Fallback if model not found
        encoding = tiktoken.get_encoding("cl100k_base")
    
    # Encode the text into a list of integers (tokens)
    token_integers = encoding.encode(text)
    
    # Return the length of that list
    return len(token_integers)

# Test it out
test_sentence = "Streaming is super cool!"
print(f"\nSentence: '{test_sentence}'")
print(f"Tokens: {count_tokens(test_sentence)}")


Step 5: The Final Chatbot with Cost Calculation

Now we combine everything. We will build a loop that:
 Takes user input.
 Counts input tokens.
 Streams the response.
 Counts output tokens.
 Calculates and displays the cost.

Note: Pricing changes. For this exercise, we will assume roughly $0.15 per 1M input tokens and $0.60 per 1M output tokens (approximate pricing for GPT-4o-mini).

Here is the complete, runnable application:

import os
import tiktoken
from openai import OpenAI

# --- CONFIGURATION ---
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
MODEL = "gpt-4o-mini"
INPUT_PRICE_PER_1M = 0.15  # $0.15 per 1 million tokens
OUTPUT_PRICE_PER_1M = 0.60 # $0.60 per 1 million tokens

# --- HELPER FUNCTIONS ---
def count_tokens(text):
    """Returns the number of tokens in a text string."""
    try:
        encoding = tiktoken.encoding_for_model(MODEL)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

def calculate_cost(input_tokens, output_tokens):
    """Returns the cost in dollars."""
    input_cost = (input_tokens / 1_000_000) * INPUT_PRICE_PER_1M
    output_cost = (output_tokens / 1_000_000) * OUTPUT_PRICE_PER_1M
    return input_cost + output_cost

# --- MAIN APP ---
def main():
    print(f"Starting Chatbot ({MODEL})...")
    print("Type 'quit' to exit.\n")

    # We keep chat history to maintain context
    chat_history = [
        {"role": "system", "content": "You are a helpful, concise assistant."}
    ]

    total_session_cost = 0.0

    while True:
        user_input = input("\nYou: ")
        if user_input.lower() in ['quit', 'exit']:
            break

        # Add user message to history
        chat_history.append({"role": "user", "content": user_input})

        # 1. Calculate Input Tokens (we have to count the WHOLE history)
        # In a real app, we'd count the JSON overhead too, but this is a good approximation
        full_conversation_text = "".join([m["content"] for m in chat_history])
        input_tokens = count_tokens(full_conversation_text)

        print("AI: ", end="")
        
        # 2. Stream the Response
        stream = client.chat.completions.create(
            model=MODEL,
            messages=chat_history,
            stream=True
        )

        full_response_text = ""
        
        for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                print(content, end="", flush=True)
                full_response_text += content
        
        print() # Newline after AI finishes

        # 3. Update History
        chat_history.append({"role": "assistant", "content": full_response_text})

        # 4. Calculate Output Tokens & Cost
        output_tokens = count_tokens(full_response_text)
        cost = calculate_cost(input_tokens, output_tokens)
        total_session_cost += cost

        # 5. Display Metrics
        print(f"\n[Metrics] In: {input_tokens} toks | Out: {output_tokens} toks | Cost: ${cost:.6f}")
        print(f"[Session Total]: ${total_session_cost:.6f}")

if __name__ == "__main__":
    main()


Understanding the Output

When you run this, notice two things:
 Speed: The AI starts talking immediately.
 Input Tokens: Watch the "In" tokens count grow. Even if you type "Hello", the input count includes every previous message in the history. This is why long conversations get expensive.

Now You Try

You have a professional engine. Now let's tune it. Try these three extensions:

 The "Shut Up" Button:

Currently, you have to wait for the AI to finish. Modify the loop to check the length of full_response_text. If the AI generates more than 200 characters (tokens), break the loop programmatically to stop it from rambling.



 The Budget Cap:

Add a check at the start of the while loop. If total_session_cost exceeds $0.001 (yes, it's small, but good for testing), print "Budget Exceeded" and automatically exit the program.



 Token Speedometer:

Import the time module. Record start_time before the stream and end_time after. Calculate "Tokens Per Second" (Total Output Tokens / Time Duration) and print it in the metrics section. This is a common benchmark for LLM performance.



Challenge Project: The Infinite Memory (Summarizer)

Here is the problem with the code above: The chat_history list keeps growing. Eventually, you will hit the model's "Context Window" limit (the maximum amount of text it can hold at once), and the program will crash.



Your Goal: Build a "Self-Summarizing Chatbot."

Requirements:

Set a MAX_HISTORY_TOKENS limit (e.g., 500 tokens for testing).


 Before sending a request to the API, check if the current history exceeds this limit.
 If it does:
    *   Take the oldest 3 messages (excluding the system prompt).
       Send them to the LLM in a separate* API call with the instruction: "Summarize this conversation into one sentence."
    *   Delete the old messages from the list.
    *   Insert a new message with role "system" that says: "Previous conversation summary: [INSERT SUMMARY HERE]".
 Then proceed with the normal chat.

Example Logic:
   Current History:* [System, User1, AI1, User2, AI2, User3]
   Too long!*
   Action:* Summarize [User1, AI1, User2] -> "User asked about Python and AI explained loops."
   New History:* [System, System(Summary), AI2, User3]

Hints:
*   You cannot stream the summary generation; just use a standard call for that part.

* You will need to manipulate the chat_history list (Python list slicing [0:3] is your friend).

* Print a notice like [!] Compressing history...` so the user knows memory management is happening.

What You Learned

Today you moved from "scripting" to "engineering."

* Streaming allows you to build interfaces that feel responsive, even when the model is thinking hard.

* Tiktoken showed you the reality of how LLMs process data—it's not words, it's tokens.

* Cost Calculation gave you the power to predict and control your API spend.

* Context Management (if you attempted the challenge) is the fundamental pattern used to make AI assistants "remember" things over long periods without running out of RAM.

Why This Matters:

In the real world, no one waits 20 seconds for a chatbot to answer. And no business runs an AI application without monitoring the cost per user. You now have the foundational code to build production-ready interfaces.

Phase 3 Complete!

You have mastered the API, the parameters, the memory, and the stream.

Tomorrow: We start Phase 4. We stop focusing on the code around the AI and start focusing on the brain of the AI. It's time for Prompt Engineering—programming with words.

← Day 29 Day 31 →