Voice Interfaces
What You'll Build Today
We are finally breaking the keyboard barrier. For the past 68 days, you have interacted with AI primarily through text. You type a prompt, you hit enter, you wait, you read.
But think about how you interact with a helpful friend. You don't pass notes back and forth. You speak, they listen, and they respond instantly. Today, we are going to build a full Voice-to-Voice AI Chatbot. It will listen to your microphone, understand your speech, think of a response, and speak it back to you.
Here is what we are covering:
* Speech-to-Text (STT): Because language models cannot hear. We need to convert sound waves into text strings using OpenAI's Whisper model.
* Text-to-Speech (TTS): Because language models cannot speak. We need to convert their text output back into high-quality audio.
* The Audio Loop: Because a conversation is a cycle. We need to chain these technologies together without losing context.
* Latency Management: Because silence is awkward. We will look at how long each step takes and why speed matters in voice interfaces.
By the end of this session, you will have a script running on your machine that you can talk to, just like Jarvis or the computer from Star Trek.
The Problem
Let's look at how we currently interact with our AI agents. Here is a standard loop we have written a dozen times:
# The "Old Way" - Text Interaction
def chat_loop():
print("AI: Hello! Type your message.")
while True:
# PAIN POINT 1: You have to stop what you are doing to type
user_input = input("You: ")
if user_input.lower() == "quit":
break
# PAIN POINT 2: It feels like a transaction, not a conversation
response = get_ai_response(user_input)
# PAIN POINT 3: You have to read the output
print(f"AI: {response}")
# It works, but it's friction.
# You can't do this while cooking, driving, or walking around the room.
The friction here is high. If you are building an AI assistant to help you debug hardware while your hands are full, or a sous-chef to help you cook, input() is useless.
Furthermore, if we try to solve this naively without modern AI tools, we run into the nightmare of raw audio processing. Before GenAI, if you wanted to detect voice, you had to deal with code that looked like this:
# The "Hard Way" - Attempting manual audio processing
# This is what life was like before modern APIs
import wave
import struct
def process_audio_legacy(file_path):
# Opening a wave file and trying to "understand" it mathematically
with wave.open(file_path, 'rb') as wf:
n_frames = wf.getnframes()
data = wf.readframes(n_frames)
# Unpacking raw bytes into integers
# This gives us numbers, but not meaning
samples = struct.unpack(f"{n_frames}h", data)
# How do we turn a list of 40,000 integers into the word "Hello"?
# Answer: You practically can't without massive deep learning models.
# We used to try Fourier Transforms and frequency matching. It was painful.
pass
We have two problems: typing is unnatural for dynamic tasks, and processing raw audio data manually is incredibly difficult. We need a way to abstract the "hearing" and "speaking" just like we abstracted the "thinking" with LLMs.
Let's Build It
We are going to build this pipeline in four distinct steps:
Prerequisites
You will need a few libraries to handle audio hardware. In your terminal, run:
``bash
pip install sounddevice scipy openai numpy
`
Note: You may need to install portaudio on your system if sounddevice gives you trouble (e.g., brew install portaudio on Mac or sudo apt-get install libportaudio2 on Ubuntu).
Step 1: The Ears (Recording Audio)
First, we need to capture sound. We will use
sounddevice to grab raw data from the microphone and scipy to save it as a .wav file. We need a file because the OpenAI API expects a file upload.
import sounddevice as sd
import numpy as np
from scipy.io.wavfile import write
import os
# Configuration
FS = 44100 # Sample rate (standard for audio)
DURATION = 5 # Duration of recording in seconds (fixed for now)
OUTPUT_FILENAME = "input_audio.wav"
def record_audio(duration=DURATION, fs=FS):
print("Listening... (Speak now)")
# sd.rec records an array of audio samples
# channels=1 means mono sound (sufficient for voice)
recording = sd.rec(int(duration * fs), samplerate=fs, channels=1)
# Wait until recording is finished
sd.wait()
print("Recording finished.")
# Save as WAV file so we can send it to the API
# We convert the numpy array to a wav file
write(OUTPUT_FILENAME, fs, recording)
return OUTPUT_FILENAME
# Test it
if __name__ == "__main__":
record_audio()
print(f"Audio saved to {OUTPUT_FILENAME}. Play it to test your mic!")
Run this. Speak for 5 seconds. Find the input_audio.wav file in your folder and play it. If you hear yourself, your ears are working.
Step 2: The Transcriber (Whisper)
Now we send that file to OpenAI's Whisper model. Whisper is incredible at handling accents, background noise, and technical jargon.
from openai import OpenAI
import os
# Initialize client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def transcribe_audio(filename):
print("Transcribing...")
# Open the audio file in binary read mode
with open(filename, "rb") as audio_file:
# Call the Whisper model (whisper-1)
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file
)
print(f"You said: {transcript.text}")
return transcript.text
# Test it (assuming you ran Step 1 and have the file)
if __name__ == "__main__":
transcribe_audio("input_audio.wav")
Run this. It should print out exactly what you said in the recording. Note that this takes a second or two—that is our first latency cost.
Step 3: The Brain (LLM)
This part you know well. We take the text from Step 2 and send it to GPT-4o.
def get_chat_response(text_input):
print("Thinking...")
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful, conversational voice assistant. Keep answers concise (1-2 sentences)."},
{"role": "user", "content": text_input}
]
)
answer = response.choices[0].message.content
print(f"AI: {answer}")
return answer
Why concise? Because listening to a 5-paragraph essay via text-to-speech is boring. We want a back-and-forth chat.
Step 4: The Mouth (TTS) and The Loop
Finally, we turn the AI's text back into audio and play it. We will use OpenAI's TTS API. Then, we combine everything into a
while loop.
We need a way to play the audio back in Python. A simple way is using the
sounddevice library again, reading the file we get back from OpenAI.
import sounddevice as sd
import soundfile as sf # Helper to read the audio file format
import time
def text_to_speech_and_play(text_input):
print("Generating voice...")
# 1. Generate the audio using OpenAI
response = client.audio.speech.create(
model="tts-1",
voice="alloy", # Options: alloy, echo, fable, onyx, nova, shimmer
input=text_input
)
# 2. Save to a temporary file
speech_file = "output_speech.mp3"
response.stream_to_file(speech_file)
# 3. Play the file
print("Speaking...")
data, fs = sf.read(speech_file)
sd.play(data, fs)
sd.wait() # Wait for audio to finish playing
def main_loop():
print("--- Voice Assistant Started ---")
print("Press Ctrl+C to stop.")
while True:
try:
# 1. Record
# We'll use a fixed 4 seconds for simplicity in this loop
audio_file = record_audio(duration=4)
# 2. Transcribe
user_text = transcribe_audio(audio_file)
# Logic to skip if silence (Whisper sometimes hallucinates on silence)
if not user_text or len(user_text) < 2:
print("...heard silence...")
continue
# 3. Think
ai_text = get_chat_response(user_text)
# 4. Speak
text_to_speech_and_play(ai_text)
# Small pause before listening again
time.sleep(1)
except KeyboardInterrupt:
print("\nGoodbye!")
break
if __name__ == "__main__":
main_loop()
Run this code. You now have a full conversation loop.
It records for 4 seconds.
It processes.
It replies.
It starts listening again.
Does it feel magical? Yes. Does it feel a bit slow? Also yes. That is the reality of cloud-based AI.
Now You Try
You have the skeleton. Now let's add some flesh to it. Try these three extensions:
The Personality Shift:
Modify the
system prompt in get_chat_response and the voice parameter in text_to_speech_and_play. Create an assistant that is a grumpy old man (try voice "onyx" and a rude prompt) or a hyper-energetic cheerleader (try voice "nova"). How does the voice match the text?
The "Wake Word" Filter:
Currently, the bot responds to everything. Modify the loop so that if the transcription doesn't start with a specific word (e.g., "Jarvis" or "Computer"), the AI ignores it and loops back to recording.
Hint: Use if "jarvis" in user_text.lower(): before calling the LLM.
Conversation Logger:
Voice is ephemeral. Once it's spoken, it's gone. Add a simple file writer that appends the transcript of the conversation (User text and AI text) to a
conversation_history.txt file so you can read what was said later.
Challenge Project: The "2-Second" Latency Challenge
The biggest barrier to voice AI is latency (the delay between you stopping speaking and the AI starting to speak).
In our current code, the delay is likely 4-6 seconds. This feels like a walkie-talkie, not a phone call. Your challenge is to optimize this loop to get the response time under 2 seconds (or as close as possible).
Requirements:
* Add timers to measure how long each step takes (Transcription vs. LLM vs. TTS).
* Switch the LLM model to
gpt-4o-mini (it is faster and cheaper).
* Change the TTS model to
tts-1 (standard) instead of tts-1-hd if you aren't already.
* Optimize the System Prompt: Tell the AI to be "Extremely brief. Max 10 words." Shorter text generates faster audio.
* Output: Print a "Latency Report" after every turn showing exactly how many milliseconds each step took.
Example Output:
Listening...
Transcribing... (0.8s)
Thinking... (0.5s)
Generating Voice... (0.6s)
Total Latency: 1.9s
Speaking: "Hello there!"
Hint: The biggest bottleneck is often the network. You can't fix your internet speed, but you can request less data (shorter text) to make the transfer faster.
What You Learned
Today you moved from text-based coding to audio-based interaction.
* Audio Input: You learned to capture raw waves using
sounddevice` and save them as files.
* Whisper API: You learned that speech-to-text is now a solved problem with high-accuracy models.
* TTS API: You learned to give your AI a voice.
* The Latency Bottleneck: You experienced the delay inherent in chaining three cloud APIs (STT -> LLM -> TTS) together.
Why This Matters:The future of AI isn't just chatbots in a browser. It is ambient computing—AI in your car, your kitchen, and your glasses. These interfaces rely entirely on voice. Understanding the pipeline of "Listen, Think, Speak" is the foundation of building accessible, futuristic hardware.
Tomorrow: We are going to tackle the privacy angle. Sending your voice to the cloud is convenient, but what if you want to run this entirely on your own laptop, offline? Tomorrow, we build a Private Local Assistant.