Google Gemini API
What You'll Build Today
Welcome to Day 25! Today, we are stepping into the ecosystem of Google Gemini. If you have been impressed by what text-based models can do, you are about to see something entirely different.
Today, we are building a Multimodal Video Analyst.
Instead of just pasting text into a prompt, you will write a program that can "watch" a video file and "listen" to the audio track, then answer questions about it. We will take a video clip, feed it into Google's Gemini Flash model, and ask it to generate a summary of the visual events and the spoken dialogue.
Here is what you will learn and why it matters:
* Google Generative AI SDK: You will learn how to communicate with Google's models using Python. This is necessary because every provider (OpenAI, Anthropic, Google) speaks a slightly different language.
* Native Multimodality: Most older models required you to turn images into text (using OCR) or audio into text (transcription) before the AI could understand them. Gemini is "multimodal native," meaning it understands pixels and sound waves directly.
* The File API: You will learn how to upload large assets (like videos or PDFs) to the cloud so the AI can process them.
* The Context Window: We will discuss "context," or how much information an AI can hold in its memory at once. Gemini offers a massive context window (over 1 million tokens), allowing you to analyze entire books or long videos in a single prompt.
Let's get started.
The Problem
Imagine you want to build an app that summarizes lecture videos for students. You have a 10-minute video file of a physics experiment. You want the AI to tell you what happened.
Before models like Gemini, this was an engineering nightmare. You had to stitch together three or four different technologies to make it work.
Here is what that "old way" code looks like. It is painful, brittle, and slow.
# THE OLD, PAINFUL WAY (Do not run this)
import cv2 # Computer Vision library
import speech_recognition as sr # Audio library
import openai # Text AI
def analyze_video_the_hard_way(video_path):
# 1. Extract audio from video
# (Requires installing ffmpeg, subprocess calls... messy)
audio_file = extract_audio(video_path)
# 2. Transcribe audio to text
recognizer = sr.Recognizer()
with sr.AudioFile(audio_file) as source:
audio_data = recognizer.record(source)
# This is slow and often inaccurate
transcript = recognizer.recognize_google(audio_data)
# 3. Extract frames (images) from video to see what's happening
video = cv2.VideoCapture(video_path)
success, image = video.read()
descriptions = []
# Loop through thousands of frames...
while success:
# We can't send every frame, it's too expensive.
# So we create complex logic to pick "important" frames.
# Then we send those images to an image-captioning model...
descriptions.append(get_image_caption(image))
success, image = video.read()
# 4. Finally, stitch it all together
prompt = f"Audio said: {transcript}. Visuals showed: {descriptions}"
response = openai.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
return response
# Result:
# - You need to manage 5 different libraries.
# - If the audio is background noise, transcription fails.
# - You lose the connection between what was said and what was shown at that exact second.
# - It is incredibly slow.
Look at that complexity. You are acting as the glue between an audio tool, a vision tool, and a text tool. If any one of them fails, your program crashes.
Wouldn't it be better if the AI could just watch the video like a human does?
Let's Build It
We are going to use the Google Gemini API. Gemini comes in different sizes, but we will focus on Gemini 1.5 Flash. It is incredibly fast, very cheap, and has a massive context window suitable for video.
Step 1: Get Your API Key
To use Gemini, you need an API key from Google AI Studio.
AIza...).Step 2: Installation and Setup
We need the google-generativeai library.
In your terminal:
``bash
pip install google-generativeai
`
Now, let's write a "Hello World" to ensure our connection works.
import os
import google.generativeai as genai
# Replace with your actual key
os.environ["GOOGLE_API_KEY"] = "YOUR_PASTED_KEY_HERE"
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
# Initialize the model
# We use 'gemini-1.5-flash' because it is fast and efficient
model = genai.GenerativeModel('gemini-1.5-flash')
# Basic text test
response = model.generate_content("Explain quantum physics to a 5 year old in one sentence.")
print(response.text)
Why this matters: This confirms your environment is configured correctly. If you see a sentence about quantum physics, you are ready to proceed.
Step 3: The File API (Uploading Media)
This is the most important concept of the day. When we want Gemini to analyze a video, we don't send the raw bytes directly in the prompt (that would crash your network). Instead, we upload the file to Google's temporary storage, get a handle (ID) for it, and give that ID to the model.
Note: For this step, please download a short .mp4 video file to your project folder. You can use a short clip from a royalty-free site like Pexels, or record a 10-second video of yourself talking.
Name the file
sample_video.mp4.
import google.generativeai as genai
import time
import os
# Configuration (ensure your key is set)
# os.environ["GOOGLE_API_KEY"] = "YOUR_KEY"
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
video_path = "sample_video.mp4"
print(f"Uploading {video_path}...")
# 1. Upload the file to Google's server
video_file = genai.upload_file(path=video_path)
print(f"Completed upload: {video_file.uri}")
# 2. Check processing state
# Videos need time to be processed by Google before the AI can see them.
# We create a loop to check the status.
while video_file.state.name == "PROCESSING":
print("Processing video...")
time.sleep(5)
video_file = genai.get_file(video_file.name)
if video_file.state.name == "FAILED":
raise ValueError(f"Video processing failed: {video_file.state.name}")
print(f"Video is ready! State: {video_file.state.name}")
Why this matters: Large files take time to process. If you try to ask the AI a question immediately after uploading, it will error out because the video isn't "watched" yet. This while loop ensures the asset is ready.
Step 4: Generating the Summary
Now that the file is uploaded and processed, we can pass the
video_file object directly to the model alongside our text prompt. This is the power of native multimodality.
# Initialize the model again
model = genai.GenerativeModel('gemini-1.5-flash')
print("Analyzing video...")
# 3. Generate content
# Notice we pass a list: [video_file, "The Prompt"]
response = model.generate_content([
video_file,
"Watch this video. Summarize what happens visually, and transcribe any spoken audio. Tell me the mood of the video."
])
print("\n=== AI ANALYSIS ===")
print(response.text)
# 4. Clean up
# It is good practice to delete the file from the cloud when done
genai.delete_file(video_file.name)
print("\n(Cloud file deleted)")
Why this matters: Notice how simple the prompt is. We didn't have to mention OCR or transcription. We just said "Watch this." The model handled the fusion of audio and visual data automatically.
Step 5: Handling Safety Settings
Sometimes, the model might refuse to answer if it thinks the video contains unsafe content (even if it's harmless). We can adjust the safety thresholds.
from google.generativeai.types import HarmCategory, HarmBlockThreshold
# Define safety settings to be more permissive (BLOCK_ONLY_HIGH)
safety_settings = {
HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_ONLY_HIGH,
HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
}
response = model.generate_content(
[video_file, "Describe the action in this video."],
safety_settings=safety_settings
)
print(response.text)
Why this matters: Google's models are conservative by default. If your video contains mild swearing or action scenes, the default settings might block the response. Knowing how to adjust these dials gives you more control.
Now You Try
You have the basics. Now, let's push the boundaries.
The Image Comparator:
Instead of a video, find two different image files (e.g.,
cat.jpg and dog.jpg). Upload both using the File API (or pass them as PIL images if you look up the documentation). Pass both files in the list to generate_content and ask: "What are the differences between these two images?"
The Podcast Summarizer:
Download a short MP3 audio clip (or record one). Upload it just like the video file. Ask Gemini to "Extract the key bullet points from this conversation and identify the speakers."
Structured Data Extraction:
Take a photo of a receipt or an invoice. Upload it. Ask Gemini to output the data in JSON format:
{"store": "name", "total": 0.00, "date": "YYYY-MM-DD"}. This turns unstructured pixels into structured database data.
Challenge Project: The "Infinite" Librarian
One of Gemini's superpowers is its massive "Context Window." It can read hundreds of thousands of words in a single prompt.
The Goal: Feed an entire book into Gemini and ask specific questions about plot details that appear in different chapters.
Requirements:
Download a plain text book from Project Gutenberg (e.g., "Alice in Wonderland" or "Sherlock Holmes"). Save it as book.txt.
Read the file in Python using standard file I/O ( with open...).
Upload the text file using the File API (just like the video) OR pass the text string directly if it's under the limit (Gemini Flash handles ~1 million tokens, which is roughly 700,000 words). Hint: The File API is safer for massive files.
Ask the model a complex question that requires understanding the beginning and the end of the book.
Example:* "Trace the character development of [Character Name] from Chapter 1 to the final chapter. Give 3 specific examples."
Print the result.
Hints:
* Text files process much faster than video files.
* If you paste the text directly into the prompt string, Python might complain if the string is too huge. Using
genai.upload_file works for text files (text/plain`) too!
* Gemini Flash is the model to use here. It is optimized for high-volume context.
What You Learned
Today you moved beyond simple text-in/text-out interactions.
* Google Generative AI SDK: You can now control Google's models programmatically.
* Multimodality: You learned that modern AI can "see" and "hear" without needing separate transcription tools.
* The File API: You learned the workflow of Upload -> Process -> Generate.
* Context Windows: You explored the idea of feeding massive amounts of data (entire books) into a prompt.
Why This Matters:In the real world, data rarely comes as clean text snippets. It comes as hour-long Zoom recordings, 50-page PDF contracts, and folders full of images. You now have the tool (Gemini) that can digest that raw information natively.
Tomorrow: We are leaving the walled gardens of Google and OpenAI. We are going to explore Open Source Models. We will look at Llama, Mistral, and how to run AI that you own and control.