Day 30 of 80

Streaming & Token Management

Phase 3: LLM Landscape & APIs

What You'll Build Today

Welcome to Day 30. You have made it incredibly far. You know how to talk to an LLM, how to give it personality, and how to build a basic chatbot. But if you were to release your chatbot to the public right now, you would face two major complaints:

  • "It's too slow. I stare at a blank screen for 10 seconds before it answers."
  • "I have no idea how much this conversation is costing me."
  • Today, we are going to fix both of those issues. We are going to build a Real-Time Streaming Chatbot with Cost Tracking.

    Here is exactly what you will learn and why:

    * Streaming Responses: Instead of waiting for the full answer, you will learn to print text character-by-character as it is generated. This is the difference between a frustrating loading spinner and a magical "typing" effect.

    * Token Management (tiktoken): You will learn how LLMs actually read text (it is not by word). You will use the tiktoken library to count exactly how much "memory" you are using.

    * Cost Calculation: You will write logic to calculate the price of every interaction in real-time, so you never get a surprise bill.

    * Context Window Management: You will learn what happens when a conversation gets too long and how to handle the limit.

    This is the day your code goes from "hobby script" to "professional application behavior." Let's get started.

    The Problem

    Let's look at the pain point first. Imagine you ask ChatGPT to write a 500-word essay.

    If you use the code we have written in previous days, your Python script sends the request and then... waits. Your program freezes. The user sees a blinking cursor. Five seconds pass. Ten seconds. The user thinks the program has crashed. Finally, bam, a giant wall of text appears all at once.

    Here is the code that causes this bad user experience. Do not run this, just look at it:

    import os
    

    from openai import OpenAI

    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    print("User: Write a short story about a robot learning to paint.")

    print("AI is thinking...")

    # THE PAIN POINT: The code blocks here for 10+ seconds

    response = client.chat.completions.create(

    model="gpt-4o-mini",

    messages=[{"role": "user", "content": "Write a short story about a robot learning to paint."}]

    )

    # Only after the WHOLE story is generated does this print run

    print(response.choices[0].message.content)

    Why this hurts:
  • Perceived Latency: Even if the AI is fast, waiting for the entire response makes it feel slow.
  • No Feedback: The user doesn't know if the request failed or is just processing.
  • The "Black Box" of Cost: You have no idea how many tokens that story used until after you paid for it. If the AI hallucinates a 5,000-word novel, you pay for all of it.
  • There has to be a way to see the text as it is being created, just like on the ChatGPT website. And there has to be a way to measure the size of the request before we send it.

    Let's Build It

    We are going to solve this in steps. First, we will fix the speed perception using Streaming. Then, we will tackle the Token Counting.

    Step 1: Setting Up the Environment

    First, we need to install a new library. OpenAI doesn't count words; it counts "tokens" (chunks of characters). To count them accurately in Python, we need a library called tiktoken.

    Action: Open your terminal and install the libraries:

    ``bash

    pip install openai tiktoken

    `

    Now, create a new file called streaming_bot.py. We will start with our standard imports.

    import os
    

    import time

    from openai import OpenAI

    # Initialize the client

    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    # Let's define a model to use. # GPT-4o-mini is fast and cheap, perfect for testing.

    MODEL = "gpt-4o-mini"

    Step 2: implementing Streaming

    To stream text, we make a small change to our API call: stream=True.

    When we do this, the API doesn't return a single response object. Instead, it returns an "iterator"—a stream of tiny data chunks. We have to loop through these chunks as they arrive.

    Add this code to your file to see the raw chunks:

    print("--- STREAMING DEMO ---")
    
    # 1. Create the request with stream=True
    

    stream = client.chat.completions.create(

    model=MODEL,

    messages=[{"role": "user", "content": "Count to 5 and tell me a joke."}],

    stream=True # <--- This is the magic switch

    )

    # 2. Iterate through the stream

    print("Receiving chunks...")

    for chunk in stream:

    # 3. Extract the content delta # In streaming, we look for 'delta', not 'message'

    content = chunk.choices[0].delta.content

    if content:

    print(f"Chunk received: '{content}'")

    Run this code.

    You will see the output looks strange. It prints each word (or part of a word) on a new line.

    Output example:
    Chunk received: 'Count'
    

    Chunk received: 'ing'

    Chunk received: ' to'

    Chunk received: ' '

    Chunk received: '5'

    ...

    This proves it's working! We are receiving data instantly, piece by piece.

    Step 3: Making it Look Like Typing

    The previous output was ugly. We want the text to flow across the screen. To do this, we use the Python print function with the argument end="". By default, print adds a newline. We tell it to add nothing instead, so the next chunk prints right next to the current one.

    Replace the loop in your code with this version:

    print("\n--- SMOOTH STREAMING ---")
    
    

    stream = client.chat.completions.create(

    model=MODEL,

    messages=[{"role": "user", "content": "Explain quantum physics in one sentence."}],

    stream=True

    )

    print("AI: ", end="") # Start the line

    full_response = ""

    for chunk in stream:

    content = chunk.choices[0].delta.content

    if content:

    # Print to console without a newline

    print(content, end="", flush=True)

    # We also need to save the text to memory to use it later

    full_response += content

    print() # Print a final newline at the very end

    print("-" * 20)

    print(f"Final stored string: {full_response}")

    Why
    flush=True?

    Sometimes Python buffers output (saves it up to print in a big batch). flush=True forces Python to put the text on the screen immediately, which is crucial for that smooth typing effect.

    Step 4: Counting Tokens with Tiktoken

    Now that we have the visual part handled, let's talk about the hidden cost.

    LLMs read tokens. A token is roughly 0.75 words.

    * "apple" = 1 token

    * "ing" = 1 token

    * "The" = 1 token

    If you send a prompt, you need to know how many tokens it is to estimate the cost. We use tiktoken for this.

    Add this helper function to the top of your script:

    import tiktoken
    
    

    def count_tokens(text, model="gpt-4o-mini"):

    try:

    # Get the encoding for the specific model

    encoding = tiktoken.encoding_for_model(model)

    except KeyError:

    # Fallback if model not found

    encoding = tiktoken.get_encoding("cl100k_base")

    # Encode the text into a list of integers (tokens)

    token_integers = encoding.encode(text)

    # Return the length of that list

    return len(token_integers)

    # Test it out

    test_sentence = "Streaming is super cool!"

    print(f"\nSentence: '{test_sentence}'")

    print(f"Tokens: {count_tokens(test_sentence)}")

    Step 5: The Final Chatbot with Cost Calculation

    Now we combine everything. We will build a loop that:

  • Takes user input.
  • Counts input tokens.
  • Streams the response.
  • Counts output tokens.
  • Calculates and displays the cost.
  • Note: Pricing changes. For this exercise, we will assume roughly $0.15 per 1M input tokens and $0.60 per 1M output tokens (approximate pricing for GPT-4o-mini).

    Here is the complete, runnable application:

    import os
    

    import tiktoken

    from openai import OpenAI

    # --- CONFIGURATION ---

    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    MODEL = "gpt-4o-mini"

    INPUT_PRICE_PER_1M = 0.15 # $0.15 per 1 million tokens

    OUTPUT_PRICE_PER_1M = 0.60 # $0.60 per 1 million tokens

    # --- HELPER FUNCTIONS ---

    def count_tokens(text):

    """Returns the number of tokens in a text string."""

    try:

    encoding = tiktoken.encoding_for_model(MODEL)

    except KeyError:

    encoding = tiktoken.get_encoding("cl100k_base")

    return len(encoding.encode(text))

    def calculate_cost(input_tokens, output_tokens):

    """Returns the cost in dollars."""

    input_cost = (input_tokens / 1_000_000) * INPUT_PRICE_PER_1M

    output_cost = (output_tokens / 1_000_000) * OUTPUT_PRICE_PER_1M

    return input_cost + output_cost

    # --- MAIN APP ---

    def main():

    print(f"Starting Chatbot ({MODEL})...")

    print("Type 'quit' to exit.\n")

    # We keep chat history to maintain context

    chat_history = [

    {"role": "system", "content": "You are a helpful, concise assistant."}

    ]

    total_session_cost = 0.0

    while True:

    user_input = input("\nYou: ")

    if user_input.lower() in ['quit', 'exit']:

    break

    # Add user message to history

    chat_history.append({"role": "user", "content": user_input})

    # 1. Calculate Input Tokens (we have to count the WHOLE history) # In a real app, we'd count the JSON overhead too, but this is a good approximation

    full_conversation_text = "".join([m["content"] for m in chat_history])

    input_tokens = count_tokens(full_conversation_text)

    print("AI: ", end="")

    # 2. Stream the Response

    stream = client.chat.completions.create(

    model=MODEL,

    messages=chat_history,

    stream=True

    )

    full_response_text = ""

    for chunk in stream:

    content = chunk.choices[0].delta.content

    if content:

    print(content, end="", flush=True)

    full_response_text += content

    print() # Newline after AI finishes

    # 3. Update History

    chat_history.append({"role": "assistant", "content": full_response_text})

    # 4. Calculate Output Tokens & Cost

    output_tokens = count_tokens(full_response_text)

    cost = calculate_cost(input_tokens, output_tokens)

    total_session_cost += cost

    # 5. Display Metrics

    print(f"\n[Metrics] In: {input_tokens} toks | Out: {output_tokens} toks | Cost: ${cost:.6f}")

    print(f"[Session Total]: ${total_session_cost:.6f}")

    if __name__ == "__main__":

    main()

    Understanding the Output

    When you run this, notice two things:

  • Speed: The AI starts talking immediately.
  • Input Tokens: Watch the "In" tokens count grow. Even if you type "Hello", the input count includes every previous message in the history. This is why long conversations get expensive.
  • Now You Try

    You have a professional engine. Now let's tune it. Try these three extensions:

  • The "Shut Up" Button:
  • Currently, you have to wait for the AI to finish. Modify the loop to check the length of full_response_text. If the AI generates more than 200 characters (tokens), break the loop programmatically to stop it from rambling.

  • The Budget Cap:
  • Add a check at the start of the while loop. If total_session_cost exceeds $0.001 (yes, it's small, but good for testing), print "Budget Exceeded" and automatically exit the program.

  • Token Speedometer:
  • Import the time module. Record start_time before the stream and end_time after. Calculate "Tokens Per Second" (Total Output Tokens / Time Duration) and print it in the metrics section. This is a common benchmark for LLM performance.

    Challenge Project: The Infinite Memory (Summarizer)

    Here is the problem with the code above: The chat_history list keeps growing. Eventually, you will hit the model's "Context Window" limit (the maximum amount of text it can hold at once), and the program will crash.

    Your Goal: Build a "Self-Summarizing Chatbot." Requirements:
  • Set a MAX_HISTORY_TOKENS limit (e.g., 500 tokens for testing).
  • Before sending a request to the API, check if the current history exceeds this limit.
  • If it does:
  • * Take the oldest 3 messages (excluding the system prompt).

    Send them to the LLM in a separate* API call with the instruction: "Summarize this conversation into one sentence."

    * Delete the old messages from the list.

    * Insert a new message with role "system" that says: "Previous conversation summary: [INSERT SUMMARY HERE]".

  • Then proceed with the normal chat.
  • Example Logic: Current History:* [System, User1, AI1, User2, AI2, User3] Too long!* Action:* Summarize [User1, AI1, User2] -> "User asked about Python and AI explained loops." New History:* [System, System(Summary), AI2, User3] Hints:

    * You cannot stream the summary generation; just use a standard call for that part.

    * You will need to manipulate the chat_history list (Python list slicing [0:3] is your friend).

    * Print a notice like [!] Compressing history...` so the user knows memory management is happening.

    What You Learned

    Today you moved from "scripting" to "engineering."

    * Streaming allows you to build interfaces that feel responsive, even when the model is thinking hard.

    * Tiktoken showed you the reality of how LLMs process data—it's not words, it's tokens.

    * Cost Calculation gave you the power to predict and control your API spend.

    * Context Management (if you attempted the challenge) is the fundamental pattern used to make AI assistants "remember" things over long periods without running out of RAM.

    Why This Matters:

    In the real world, no one waits 20 seconds for a chatbot to answer. And no business runs an AI application without monitoring the cost per user. You now have the foundational code to build production-ready interfaces.

    Phase 3 Complete!

    You have mastered the API, the parameters, the memory, and the stream.

    Tomorrow: We start Phase 4. We stop focusing on the code around the AI and start focusing on the brain of the AI. It's time for Prompt Engineering—programming with words.