Day 67 of 80

Production Streaming

Phase 7: Advanced Techniques

What You'll Build Today

If you have used ChatGPT or Claude, you know the feeling. You ask a question, and the answer starts appearing immediately, word by word, as if a ghost is typing it out for you. It feels fast, responsive, and alive.

Now, compare that to the web apps we have built so far. You click a button, the browser loading spinner starts turning, and you wait. And wait. And wait. Ten seconds later, the entire paragraph slams onto the screen at once.

Today, we are going to fix that. We are going to build the "Ghost in the Machine" effect.

You will build a FastAPI server that connects to an LLM and streams the response to a web browser in real-time. You will create a simple frontend that catches these chunks of text and displays them instantly.

Here is what you will learn and why:

* Server-Sent Events (SSE): The standard web protocol for one-way communication from server to client. This is how we keep a connection open to push text.

* Python Generators: You have used return to send data back. Today you will use yield. This allows your function to hand over data piece by piece without finishing the whole job first.

* FastAPI StreamingResponse: The specific tool in our web framework that handles open connections and trickling data.

* Asynchronous Iteration: How to loop through data that doesn't exist yet (because the AI is still thinking of it).

The Problem

Let's look at how we have been building APIs up until now.

Imagine you are building a story generator. You ask the AI to write a 500-word story. Even for a fast model, generating 500 words might take 5 to 10 seconds.

Here is the standard "Request/Response" code we usually write:

from fastapi import FastAPI
from pydantic import BaseModel
import time

app = FastAPI()

class Prompt(BaseModel):
    text: str

# This simulates a slow LLM generation
def slow_llm_generation():
    result = []
    words = ["Once", "upon", "a", "time", "in", "a", "digital", "land..."]
    for word in words:
        time.sleep(1) # Simulate thinking time
        result.append(word)
    return " ".join(result)

@app.post("/generate")
def generate_story(prompt: Prompt):
    # The server creates the WHOLE response first
    story = slow_llm_generation()
    
    # Then sends it all at once
    return {"story": story}

The Pain:

When a user calls /generate, their browser hangs. It looks like the website has crashed. They stare at a white screen for 8 seconds (1 second per word). Then, suddenly, the whole sentence appears.

If the generation takes 30 seconds, most users will close the tab before they see a single word. This is a terrible user experience. We need a way to send the word "Once" the moment it is ready, then "upon" a second later, and so on.

There has to be a way to keep the HTTP line open and trickle data down as it is created.

Let's Build It

We are going to switch from a "Request/Response" model to a "Streaming" model using Server-Sent Events (SSE).

Step 1: Understanding the Generator

To stream data, we cannot use return. When a function hits return, it is done. It closes up shop and leaves memory.

We need a function that can hand over a value, pause its execution, keep its state, and resume when asked for the next value. In Python, we do this with the yield keyword. A function that uses yield is called a Generator.

Let's write a simple generator that simulates an AI.

import asyncio

# Notice we use 'async' because we want to play nice with FastAPI later
async def fake_ai_generator():
    words = ["This ", "is ", "coming ", "in ", "real ", "time."]
    for word in words:
        await asyncio.sleep(0.5) # Simulate network lag/generation time
        print(f"Server is yielding: {word}")
        yield word

# You can't just call this function. You have to iterate over it.
# If you ran this in a script:
# async for chunk in fake_ai_generator():
#     print(chunk)

Step 2: The Streaming Endpoint

Now we need to tell FastAPI to use this generator. We cannot return a dictionary or a string anymore. We must return a StreamingResponse.

We also need to set the media_type to text/event-stream. This tells the browser: "Don't close the connection after the first packet. Keep listening."

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio

app = FastAPI()

async def fake_ai_generator():
    words = ["Hello ", "human. ", "I ", "am ", "streaming ", "now."]
    for word in words:
        await asyncio.sleep(0.5) 
        # In SSE, it's safer to encode string data explicitly
        yield word

@app.get("/stream")
async def stream_endpoint():
    # We pass the generator function (called) to StreamingResponse
    return StreamingResponse(fake_ai_generator(), media_type="text/event-stream")

# Save this as main.py and run with: uvicorn main:app --reload

If you visit http://localhost:8000/stream in your browser now, you will see the words appear one by one (depending on your browser, it might buffer slightly, but standard tools like curl show it instantly).

Step 3: Integrating a Real LLM (OpenAI)

Now let's replace our fake list of words with real data from OpenAI. The OpenAI library allows us to set stream=True. This returns an iterator (a stream) instead of a finished object.

Note: You need your OPENAI_API_KEY set in your environment.

import os
from openai import AsyncOpenAI
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()
client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))

async def openai_stream_generator(user_prompt: str):
    stream = await client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": user_prompt}],
        stream=True, # <--- This is the magic switch
    )

    # We iterate through the stream as chunks arrive
    async for chunk in stream:
        if chunk.choices[0].delta.content is not None:
            content = chunk.choices[0].delta.content
            yield content

@app.get("/ask-ai")
async def ask_ai(prompt: str):
    return StreamingResponse(
        openai_stream_generator(prompt), 
        media_type="text/event-stream"
    )

Step 4: The Frontend Client

A streaming API is useless if the frontend waits for the whole thing to finish. We need a simple HTML page that uses JavaScript's EventSource (or fetch with a reader) to handle the incoming data.

We will serve a simple HTML page directly from FastAPI for this demonstration.

Full Runnable Code (main.py):

import os
import asyncio
from fastapi import FastAPI
from fastapi.responses import StreamingResponse, HTMLResponse
from openai import AsyncOpenAI

app = FastAPI()
client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# 1. The HTML Client
html_content = """


    
        GenAI Stream
        
    
    
        Streaming AI Response
        
        
        

        
    

"""

# 2. The Generator
async def openai_stream_generator(user_prompt: str):
    try:
        stream = await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": user_prompt}],
            stream=True,
        )

        async for chunk in stream:
            if chunk.choices[0].delta.content is not None:
                yield chunk.choices[0].delta.content
    except Exception as e:
        yield f"\n[Error: {str(e)}]"

# 3. The Endpoints
@app.get("/")
async def get_page():
    return HTMLResponse(content=html_content)

@app.get("/ask-ai")
async def ask_ai(prompt: str):
    return StreamingResponse(
        openai_stream_generator(prompt), 
        media_type="text/event-stream"
    )

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Run this code. Go to http://localhost:8000. Type "Write a poem about Python code" and hit Generate. Watch it appear word for word!

Now You Try

You have the basics. Now let's make it robust.

System Prompting: Modify the openai_stream_generator to accept a system prompt. Hardcode the system prompt to be a "Pirate Captain." Verify that the stream comes back in pirate-speak.

Handling Newlines: Sometimes raw text streams can look messy. In the HTML JavaScript section, modify the code so that if the incoming text contains a newline character \n, it appends a

tag instead of just text content. Hint: You will need to change .textContent to .innerHTML.

Loading State: The moment the user clicks "Generate", the button should be disabled so they don't click it twice. Re-enable it when the loop (while (true)) breaks.

Challenge Project

Task: Build a "Cancellable Stream."

One of the biggest benefits of streaming is that if the user sees the answer is going in the wrong direction, they can stop it immediately to save your API costs.

Requirements:

Add a "Stop" button to the HTML interface next to the "Generate" button.

When "Stop" is clicked, the JavaScript should abort the fetch request.

Bonus Difficulty: On the Python server side, add a print statement inside the generator loop: print("Chunk sent").

Verify that when you click "Stop" on the client, the server eventually stops printing "Chunk sent".

Hints:

* In JavaScript, look up AbortController. You pass an abortController.signal to the fetch call.

* When you call controller.abort(), the fetch throws an error. You will need a try/catch block in your JavaScript.

* FastAPI/Starlette handles client disconnects automatically by raising an error in the generator when it tries to yield to a closed connection. You don't need complex server code, just ensure your generator doesn't swallow errors silently.

What You Learned

Today you moved from "Batch Processing" (waiting for the whole job) to "Stream Processing" (handling data as it flows).

* Generators (yield): You learned that functions can pause and resume, enabling memory-efficient data processing.

* StreamingResponse: You learned how to keep an HTTP connection open to push data.

* Client Consumption: You built a frontend that decodes binary streams into text in real-time.

Why This Matters:

In production GenAI apps, latency is the enemy. Streaming doesn't make the AI think faster, but it makes the perceived speed much faster. It engages the user immediately. Furthermore, for long generations (like writing code or essays), streaming prevents HTTP timeouts that happen when a request takes too long.

Tomorrow: Security hardening. Now that your AI is on the web, people will try to break it. We will learn how to protect your prompts and your wallet.

← Day 66 Day 68 →