OpenAI API: Vision & JSON Mode
What You'll Build Today
Up until now, we have been treating AI like a very smart pen pal. We send it text, and it sends text back. But the real world isn't just text—it is visual. And modern applications don't just want chatty responses; they need structured data that code can understand.
Today, we are going to break out of the text-only box. You are going to build a Smart Food Analyzer. You will give your Python script a photo of a meal, and instead of just chatting about how tasty it looks, the AI will analyze the image and return a strict data structure containing the estimated calories, a list of ingredients, and the cuisine type.
Here are the concepts you will master:
* GPT-4o Vision Capabilities: Because describing a photo to an AI via text is tedious and inaccurate. We will let the AI "see" for itself.
* JSON Mode: Because your Python code cannot easily understand conversational paragraphs. We need the AI to speak the language of data (JSON).
* Base64 Encoding: Because computers need a way to turn an image file (binary data) into a text string that can be sent over the internet.
* The Seed Parameter: Because science requires reproducibility. We will learn how to force the AI to be more consistent for testing.
Let's give your AI eyes.
The Problem
Imagine you are building a fitness app. You want users to snap a photo of their lunch, and the app should automatically log the calories.
Without Vision capabilities, you would have to ask the user to type out everything they are eating. That is a bad user experience.
But even if you could send the image to an AI, you run into the "Chatty AI" problem.
Let's look at a scenario where you ask an older model (or a model without specific instructions) to analyze a salad.
# The "Old Way" - Frustrating and fragile
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": "Here is a list of ingredients in my salad: lettuce, croutons, parmesan, caesar dressing. Extract the ingredients and calories."}
]
)
print(response.choices[0].message.content)
Output:
> "That sounds like a delicious Caesar salad! Based on typical serving sizes, here is what you are looking at:
> 1. Lettuce
> 2. Croutons
> 3. Parmesan Cheese
>
> Total calories are likely around 400. Enjoy your meal!"
This output is great for a human, but it is a nightmare for your code.
If you want to save the calorie count to a database, you have to write Python code to find the number "400" in that sentence. What if the AI says "It's about four hundred calories"? Now your number finder fails. What if it adds a polite intro?
You end up writing code like this:
# PAIN: Trying to parse conversational text
text = response.choices[0].message.content
if "calories" in text:
# Hope the number is right before the word calories?
# This is fragile and will break often.
calories = text.split("calories")[0].split()[-1]
This is the "Pain." You are fighting the AI's tendency to be conversational when what you really need is raw data.
There has to be a way to tell the AI: "Look at this image, don't chat with me, just give me the data in a format my code can actually use."
Let's Build It
We are going to solve this using OpenAI's gpt-4o model, which handles images natively, and JSON Mode, which forces the output into a clean data format.
Step 1: Setup and Basic Vision (URL)
First, let's see how to send an image to the API. The easiest way is to provide a URL to an image hosted online.
We will use a standard structure for the message. Instead of just content: "string", the content becomes a list of dictionaries, allowing us to mix text and images.
import os
from openai import OpenAI
# Initialize the client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# An image of a burger found online
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/0/0b/RedDot_Burger_Music_City.jpg/640px-RedDot_Burger_Music_City.jpg"
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{
"type": "image_url",
"image_url": {
"url": image_url,
},
},
],
}
],
max_tokens=300,
)
print(response.choices[0].message.content)
Why this matters: You just bypassed the need for complex computer vision software. You sent a picture, and the AI understood it immediately.
Step 2: The Data Extraction (JSON Mode)
Now, let's solve the "Chatty AI" problem. We want to extract specific details about that burger: the ingredients list, the estimated calories, and a boolean (True/False) checking if it looks vegetarian.
To do this, we use the response_format parameter and set it to {"type": "json_object"}.
import json # We need this standard library to parse the response
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are a nutritionist AI. Output your analysis in JSON format."
},
{
"role": "user",
"content": [
{"type": "text", "text": "Analyze this image. Return a JSON object with keys: 'ingredients' (list of strings), 'calories' (integer), and 'is_vegetarian' (boolean)."},
{
"type": "image_url",
"image_url": {"url": image_url},
},
],
}
],
response_format={"type": "json_object"}, # This forces structured output
temperature=0.0, # Keep it factual
)
# The response is still a string, but it is a string formatted as JSON
json_string = response.choices[0].message.content
print("Raw String:\n", json_string)
# Convert the string into a real Python dictionary
data = json.loads(json_string)
print("\nParsed Data:")
print(f"Calories: {data['calories']}")
print(f"Vegetarian: {data['is_vegetarian']}")
print(f"First Ingredient: {data['ingredients'][0]}")
Why this matters: Look at how clean that Python code is at the end. No string splitting, no guessing. We treat the AI response just like a database query result.
Step 3: Handling Local Images (Base64)
In a real app, users upload photos from their phones; they don't provide URLs. To send a local file to the API, we need to encode it.
Computers store images as binary data (0s and 1s). APIs expect text. Base64 is a way to represent that binary data using a specific set of text characters. It looks like a massive string of random letters.
Here is a helper function to encode a local image, and how to use it.
Note: For this step, you will need a file namedmy_food.jpg in your project folder. You can download any food image and rename it.
import base64
# Function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# Path to your local image
local_image_path = "my_food.jpg"
# Get the base64 string
base64_image = encode_image(local_image_path)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What food is this? Return JSON with key 'food_name'."},
{
"type": "image_url",
"image_url": {
# We must specify the format for base64
"url": f"data:image/jpeg;base64,{base64_image}"
},
},
],
}
],
response_format={"type": "json_object"},
)
print(response.choices[0].message.content)
Why this matters: You now have the power to process any image on your hard drive or any image a user uploads to your future web application.
Step 4: Reproducibility with Seeds
LLMs are non-deterministic. If you send the same image twice, you might get "500 calories" once and "550 calories" the next time. This makes testing hard.
OpenAI allows you to set a seed parameter (an integer). If you use the same seed, the model will try its best to return the exact same result every time.
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are a helpful assistant designed to output JSON."
},
{
"role": "user",
"content": "Generate a JSON list of 3 random colors."
}
],
response_format={"type": "json_object"},
seed=12345 # Using a constant seed
)
print(response.choices[0].message.content)
If you run this code 5 times, you should get the same 3 colors every time. If you remove the seed, they will change.
Why this matters: When you are debugging your code, you want to know if a bug is caused by your logic or by the AI changing its mind. The seed helps you isolate variables.Now You Try
You have the tools. Now extend the functionality.
Find an image of a receipt online. Write a script that takes the image and outputs a JSON object containing {"total_amount": float, "merchant_name": string, "date": string}. This is the core technology behind expense tracking apps.
Write a script that takes an image and returns JSON: {"is_safe_for_work": boolean, "description": string}. If the image contains something violent or inappropriate (you can test this with a picture of a scary movie poster), the boolean should be false.
Modify the Food Analyzer. What happens if you upload a picture of a shoe instead of food? Update your system prompt to say: "If the image is not food, return JSON {'error': 'Not food', 'is_food': false}." Test it with a non-food image.
Challenge Project: The "Common Thread" Detector
Your challenge is to build a script that mimics a sophisticated cognitive task: finding patterns across multiple visual inputs.
The Task:Create a script that accepts three different image URLs (e.g., a picture of a tent, a picture of a campfire, and a picture of a hiking boot).
Requirements:content list can have multiple image_url items). ``json
{
"common_theme": "Camping",
"confidence_score": 0.98,
"items_identified": ["Tent", "Fire", "Boot"]
}
`
Parse the JSON string into a Python dictionary and print the "common_theme" in all caps.
Hints:
* Your
content list will look like: [{text}, {image_url}, {image_url}, {image_url}].
* Don't forget to set
detail: "low" in the image parameters if you want to save tokens (optional, but good practice).
* Remember to import
json` to parse the result.
What You Learned
Today you bridged the gap between "chatting with a bot" and "building a software application."
* Vision: You learned that GPT-4o can process visual data just as easily as text.
* JSON Mode: You learned to tame the LLM's output so your Python code doesn't crash trying to read sentences.
* Base64: You learned the standard way to move binary files into text-based APIs.
* Seeds: You learned how to stabilize the AI for testing.
Why This Matters:Most real-world data is messy. It comes in PDFs, screenshots, and photos. Being able to ingest that visual data and convert it immediately into structured JSON is a superpower. It allows you to build apps that "do" things (calculate taxes, log food, sort inventory) rather than just talk about things.
Tomorrow: We switch gears to a different model family. You will meet Anthropic Claude, the "writer's AI," and see how it handles large amounts of text differently than GPT.