Production Streaming
What You'll Build Today
If you have used ChatGPT or Claude, you know the feeling. You ask a question, and the answer starts appearing immediately, word by word, as if a ghost is typing it out for you. It feels fast, responsive, and alive.
Now, compare that to the web apps we have built so far. You click a button, the browser loading spinner starts turning, and you wait. And wait. And wait. Ten seconds later, the entire paragraph slams onto the screen at once.
Today, we are going to fix that. We are going to build the "Ghost in the Machine" effect.
You will build a FastAPI server that connects to an LLM and streams the response to a web browser in real-time. You will create a simple frontend that catches these chunks of text and displays them instantly.
Here is what you will learn and why:
* Server-Sent Events (SSE): The standard web protocol for one-way communication from server to client. This is how we keep a connection open to push text.
* Python Generators: You have used return to send data back. Today you will use yield. This allows your function to hand over data piece by piece without finishing the whole job first.
* FastAPI StreamingResponse: The specific tool in our web framework that handles open connections and trickling data.
* Asynchronous Iteration: How to loop through data that doesn't exist yet (because the AI is still thinking of it).
The Problem
Let's look at how we have been building APIs up until now.
Imagine you are building a story generator. You ask the AI to write a 500-word story. Even for a fast model, generating 500 words might take 5 to 10 seconds.
Here is the standard "Request/Response" code we usually write:
from fastapi import FastAPI
from pydantic import BaseModel
import time
app = FastAPI()
class Prompt(BaseModel):
text: str
# This simulates a slow LLM generation
def slow_llm_generation():
result = []
words = ["Once", "upon", "a", "time", "in", "a", "digital", "land..."]
for word in words:
time.sleep(1) # Simulate thinking time
result.append(word)
return " ".join(result)
@app.post("/generate")
def generate_story(prompt: Prompt):
# The server creates the WHOLE response first
story = slow_llm_generation()
# Then sends it all at once
return {"story": story}
The Pain:
When a user calls /generate, their browser hangs. It looks like the website has crashed. They stare at a white screen for 8 seconds (1 second per word). Then, suddenly, the whole sentence appears.
If the generation takes 30 seconds, most users will close the tab before they see a single word. This is a terrible user experience. We need a way to send the word "Once" the moment it is ready, then "upon" a second later, and so on.
There has to be a way to keep the HTTP line open and trickle data down as it is created.
Let's Build It
We are going to switch from a "Request/Response" model to a "Streaming" model using Server-Sent Events (SSE).
Step 1: Understanding the Generator
To stream data, we cannot use return. When a function hits return, it is done. It closes up shop and leaves memory.
We need a function that can hand over a value, pause its execution, keep its state, and resume when asked for the next value. In Python, we do this with the yield keyword. A function that uses yield is called a Generator.
Let's write a simple generator that simulates an AI.
import asyncio
# Notice we use 'async' because we want to play nice with FastAPI later
async def fake_ai_generator():
words = ["This ", "is ", "coming ", "in ", "real ", "time."]
for word in words:
await asyncio.sleep(0.5) # Simulate network lag/generation time
print(f"Server is yielding: {word}")
yield word
# You can't just call this function. You have to iterate over it.
# If you ran this in a script:
# async for chunk in fake_ai_generator():
# print(chunk)
Step 2: The Streaming Endpoint
Now we need to tell FastAPI to use this generator. We cannot return a dictionary or a string anymore. We must return a StreamingResponse.
We also need to set the media_type to text/event-stream. This tells the browser: "Don't close the connection after the first packet. Keep listening."
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio
app = FastAPI()
async def fake_ai_generator():
words = ["Hello ", "human. ", "I ", "am ", "streaming ", "now."]
for word in words:
await asyncio.sleep(0.5)
# In SSE, it's safer to encode string data explicitly
yield word
@app.get("/stream")
async def stream_endpoint():
# We pass the generator function (called) to StreamingResponse
return StreamingResponse(fake_ai_generator(), media_type="text/event-stream")
# Save this as main.py and run with: uvicorn main:app --reload
If you visit http://localhost:8000/stream in your browser now, you will see the words appear one by one (depending on your browser, it might buffer slightly, but standard tools like curl show it instantly).
Step 3: Integrating a Real LLM (OpenAI)
Now let's replace our fake list of words with real data from OpenAI. The OpenAI library allows us to set stream=True. This returns an iterator (a stream) instead of a finished object.
import os
from openai import AsyncOpenAI
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
async def openai_stream_generator(user_prompt: str):
stream = await client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": user_prompt}],
stream=True, # <--- This is the magic switch
)
# We iterate through the stream as chunks arrive
async for chunk in stream:
if chunk.choices[0].delta.content is not None:
content = chunk.choices[0].delta.content
yield content
@app.get("/ask-ai")
async def ask_ai(prompt: str):
return StreamingResponse(
openai_stream_generator(prompt),
media_type="text/event-stream"
)
Step 4: The Frontend Client
A streaming API is useless if the frontend waits for the whole thing to finish. We need a simple HTML page that uses JavaScript's EventSource (or fetch with a reader) to handle the incoming data.
We will serve a simple HTML page directly from FastAPI for this demonstration.
Full Runnable Code (main.py):import os
import asyncio
from fastapi import FastAPI
from fastapi.responses import StreamingResponse, HTMLResponse
from openai import AsyncOpenAI
app = FastAPI()
client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# 1. The HTML Client
html_content = """
GenAI Stream
Streaming AI Response
"""
# 2. The Generator
async def openai_stream_generator(user_prompt: str):
try:
stream = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": user_prompt}],
stream=True,
)
async for chunk in stream:
if chunk.choices[0].delta.content is not None:
yield chunk.choices[0].delta.content
except Exception as e:
yield f"\n[Error: {str(e)}]"
# 3. The Endpoints
@app.get("/")
async def get_page():
return HTMLResponse(content=html_content)
@app.get("/ask-ai")
async def ask_ai(prompt: str):
return StreamingResponse(
openai_stream_generator(prompt),
media_type="text/event-stream"
)
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Run this code. Go to http://localhost:8000. Type "Write a poem about Python code" and hit Generate. Watch it appear word for word!
Now You Try
You have the basics. Now let's make it robust.
openai_stream_generator to accept a system prompt. Hardcode the system prompt to be a "Pirate Captain." Verify that the stream comes back in pirate-speak.\n, it appends a
tag instead of just text content. Hint: You will need to change .textContent to .innerHTML.while (true)) breaks.Challenge Project
Task: Build a "Cancellable Stream."One of the biggest benefits of streaming is that if the user sees the answer is going in the wrong direction, they can stop it immediately to save your API costs.
Requirements:fetch request.print("Chunk sent").* In JavaScript, look up AbortController. You pass an abortController.signal to the fetch call.
* When you call controller.abort(), the fetch throws an error. You will need a try/catch block in your JavaScript.
* FastAPI/Starlette handles client disconnects automatically by raising an error in the generator when it tries to yield to a closed connection. You don't need complex server code, just ensure your generator doesn't swallow errors silently.
What You Learned
Today you moved from "Batch Processing" (waiting for the whole job) to "Stream Processing" (handling data as it flows).
* Generators (yield): You learned that functions can pause and resume, enabling memory-efficient data processing.
* StreamingResponse: You learned how to keep an HTTP connection open to push data.
* Client Consumption: You built a frontend that decodes binary streams into text in real-time.
Why This Matters:In production GenAI apps, latency is the enemy. Streaming doesn't make the AI think faster, but it makes the perceived speed much faster. It engages the user immediately. Furthermore, for long generations (like writing code or essays), streaming prevents HTTP timeouts that happen when a request takes too long.
Tomorrow: Security hardening. Now that your AI is on the web, people will try to break it. We will learn how to protect your prompts and your wallet.