Embeddings & Semantic Search
What You'll Build Today
Welcome to Day 28! Today marks a massive shift in how you will interact with language models. Up until now, we have been treating LLMs like very smart chatbots: we send them text, and they send text back.
But there is a problem. What if you have a company handbook with 5,000 pages of PDF documents? You cannot paste 5,000 pages into ChatGPT's prompt box. It is too expensive, and it exceeds the context limit.
You need a way to find only the relevant paragraphs and send those to the LLM. But standard search (Ctrl+F) is terrible. If you search for "vacation policy" but the handbook calls it "time off guidelines," Ctrl+F fails.
Today, we are building a Semantic Search Tool. You will write code that converts text into lists of numbers (vectors) representing their meaning.
Here is what you will learn:
* Embeddings: Why converting text into numbers allows computers to understand concepts, not just keywords.
* Vector Math (Simplified): How to mathematically calculate how similar two sentences are.
* Semantic Search: Building a search engine that knows "canine" and "dog" are related, even if they don't share any letters.
* The "Apple" Problem: Proving that the computer can tell the difference between Apple (the fruit) and Apple (the tech company) based on context.
Let's turn words into math.
The Problem
Let's look at why the "Ctrl+F" approach (keyword search) is broken for AI applications.
Imagine you are building a support bot for an internet provider. You have a list of common issues in your database. A user asks: "My internet is dead."
If you use standard Python string matching, look what happens:
# A simple database of support tickets / FAQ titles
database = [
"Wi-Fi signal is weak in the basement",
"Billing: How to update credit card",
"Router lights are not blinking",
"Connection speed is very slow"
]
user_query = "My internet is dead"
# The "Old Way": Keyword Search
found_match = False
print(f"Searching for: '{user_query}'...")
for entry in database:
# We check if words from the query appear in the database entry
# This is a naive keyword search
if "internet" in entry.lower() or "dead" in entry.lower():
print(f"MATCH FOUND: {entry}")
found_match = True
if not found_match:
print("No relevant results found.")
The Output:
Searching for: 'My internet is dead'...
No relevant results found.
The Pain:
This is incredibly frustrating. As humans, we know that "Router lights are not blinking" is very relevant to "My internet is dead." But the computer doesn't know that. It looks for the string "internet" or "dead" and finds neither.
To fix this using the old way, you would have to write hundreds of if statements: if "dead" or "broken" or "down" or "offline".... It is impossible to maintain.
We need a way to search by meaning, not by spelling.
Let's Build It
We are going to use OpenAI's Embedding API. This API doesn't generate text (like GPT-4); it generates a list of floating-point numbers called a vector.
When two pieces of text have similar meanings, their lists of numbers look mathematically similar.
Step 1: Setup and Libraries
We need openai for the API and a library called numpy to handle the math. If you haven't installed numpy yet, you will need to do so.
``bash
pip install openai numpy
`
Now, create your file
semantic_search.py.
import os
import numpy as np
from openai import OpenAI
# Initialize the client
# Make sure your OPENAI_API_KEY is set in your environment variables
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
print("Libraries loaded and client ready.")
Step 2: Get Your First Embedding
Let's see what an embedding actually looks like. We will send the word "Apple" to the model
text-embedding-3-small.
def get_embedding(text):
"""
Takes a string of text and returns a list of floats (the vector).
"""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
# Extract the vector from the response
return response.data[0].embedding
# Let's test it
word = "Apple"
vector = get_embedding(word)
print(f"Word: {word}")
print(f"Vector length: {len(vector)}")
print(f"First 10 numbers: {vector[:10]}")
Run this code. You will see a list of numbers like [-0.012, 0.045, ...].
* Why this matters: The model
text-embedding-3-small turns any text into a list of 1,536 numbers. These numbers represent the "coordinates" of that word in a massive multi-dimensional concept space.
Step 3: The Math of Similarity (Cosine Similarity)
How do we compare two lists of 1,536 numbers? We calculate the "angle" between them.
* If the angle is 0, they are identical (Similarity = 1.0).
* If they point in opposite directions, they are opposites (Similarity = -1.0).
* If they are unrelated (90 degrees), the similarity is 0.
This is called Cosine Similarity. We will write a helper function using
numpy to calculate this.
Add this function to your code:
def cosine_similarity(a, b):
"""
Calculates the cosine similarity between two vectors.
Returns a score between -1 (opposite) and 1 (identical).
"""
# Convert lists to numpy arrays for faster math
vec_a = np.array(a)
vec_b = np.array(b)
# The formula: (A . B) / (|A| * |B|)
dot_product = np.dot(vec_a, vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
return dot_product / (norm_a * norm_b)
# Test the math
vec1 = [1, 1, 1]
vec2 = [1, 1, 1] # Identical
vec3 = [-1, -1, -1] # Opposite
print(f"Identical similarity: {cosine_similarity(vec1, vec2)}")
print(f"Opposite similarity: {cosine_similarity(vec1, vec3)}")
Step 4: The "Apple" Test
Now for the magic. We will compare the word "Apple" against "Fruit" and "iPhone".
Wait—"Apple" is ambiguous. Does the computer know which Apple we mean? Let's see if context changes the math.
print("\n--- The Apple Context Test ---")
# 1. Embed the word "Apple" alone
vec_apple = get_embedding("Apple")
# 2. Embed related concepts
vec_fruit = get_embedding("Fruit")
vec_iphone = get_embedding("iPhone")
vec_dog = get_embedding("Dog")
# 3. Compare
score_fruit = cosine_similarity(vec_apple, vec_fruit)
score_iphone = cosine_similarity(vec_apple, vec_iphone)
score_dog = cosine_similarity(vec_apple, vec_dog)
print(f"Similarity 'Apple' vs 'Fruit': {score_fruit:.4f}")
print(f"Similarity 'Apple' vs 'iPhone': {score_iphone:.4f}")
print(f"Similarity 'Apple' vs 'Dog': {score_dog:.4f}")
Run this. You will likely see that Apple is somewhat close to both Fruit and iPhone, but very far from Dog.
But now, let's try Contextual Sentences.
print("\n--- Contextual Sentence Test ---")
sentence_1 = "I like to eat apples"
sentence_2 = "I like to use my iPhone"
sentence_3 = "The company Apple just released a new product"
sentence_4 = "The fruit apple is red and tasty"
# Let's see which sentence is closer to "The company Apple..." (Sentence 3)
vec_1 = get_embedding(sentence_1)
vec_2 = get_embedding(sentence_2)
vec_3 = get_embedding(sentence_3) # This is our anchor
vec_4 = get_embedding(sentence_4)
print(f"Query: '{sentence_3}'")
print(f"Score vs '{sentence_1}' (Fruit context): {cosine_similarity(vec_3, vec_1):.4f}")
print(f"Score vs '{sentence_2}' (Tech context): {cosine_similarity(vec_3, vec_2):.4f}")
The Result: You should see that "The company Apple..." has a higher similarity score with "I like to use my iPhone" than with "I like to eat apples," even though the word "Apple" appears in the fruit sentence! The model understands the intent.
Step 5: Fixing the Support Bot
Let's solve the problem from the beginning of the lesson. We will build a semantic search loop.
print("\n--- Semantic Search Engine ---")
database = [
"Wi-Fi signal is weak in the basement",
"Billing: How to update credit card",
"Router lights are not blinking",
"Connection speed is very slow"
]
user_query = "My internet is dead"
# 1. Embed the query
query_vec = get_embedding(user_query)
# 2. Search the database
best_score = -1
best_match = ""
print(f"Query: {user_query}")
for entry in database:
# Embed the database entry
# (In production, you would do this once and save it, not every time!)
entry_vec = get_embedding(entry)
score = cosine_similarity(query_vec, entry_vec)
print(f" - '{entry}': {score:.4f}")
if score > best_score:
best_score = score
best_match = entry
print(f"\nWINNER: '{best_match}' with score {best_score:.4f}")
Run this.
Even though "My internet is dead" shares zero keywords with "Router lights are not blinking," the similarity score will be high (likely > 0.4 or 0.5), while "Billing" will be low.
You have just built a search engine that thinks like a human.
Now You Try
You have the basics. Now, experiment with these three extensions to solidify your knowledge.
The Threshold Filter:
Modify the search loop. Instead of just finding the best match, print all matches that have a similarity score above
0.4. Test it with a query that might apply to two items (e.g., "internet money" might match both billing and connection).
Cross-Lingual Search:
Embed the English sentence "Where is the bathroom?"
Then, create a database with Spanish ("Donde esta el bano") and French ("Ou sont les toilettes").
Run the search. Does the English query match the Spanish text? (Spoiler: Yes, embeddings capture meaning across languages!)
The "Opposite" Game:
Create a list of emotions: "Happy", "Sad", "Angry", "Excited".
Embed "Tragedy".
Write code to find the word with the lowest cosine similarity score (the most unrelated or opposite concept).
Challenge Project: The Relevance Sorter
Your challenge is to create a reusable function that takes a query and a list of documents, and returns the documents sorted from most relevant to least relevant.
Requirements:
Define a function sort_by_relevance(query, documents).
documents should be a list of strings.
The function must embed the query and every document.
It must calculate similarity scores for all of them.
It must return a list of tuples: (score, document_text), sorted descending by score.
Print the results nicely formatted.
Example Input:
query = "What should I feed my cat?"
docs = [
"Toyota Camrys are reliable cars.",
"Felines enjoy eating tuna and chicken.",
"Dogs are loyal companions.",
"It is raining outside."
]
Expected Output:
Query: What should I feed my cat?
--------------------------------------------------
0.6521 - Felines enjoy eating tuna and chicken.
0.3412 - Dogs are loyal companions.
0.1205 - Toyota Camrys are reliable cars.
0.0912 - It is raining outside.
Hint: In Python, you can sort a list of tuples using results.sort(key=lambda x: x[0], reverse=True)`.
What You Learned
Today you unlocked the ability to make applications that "understand" meaning.
* Embeddings turn text into vectors (lists of numbers).
* Cosine Similarity measures how close two meanings are.
* Semantic Search is superior to keyword search because it handles synonyms ("canine" vs "dog") and context ("Apple inc" vs "Apple pie").
Why This Matters:In the real world, you cannot feed a 500-page manual into GPT-4. It is too big. Instead, you use the techniques from today to:
This pattern is called RAG (Retrieval Augmented Generation), and it is the backbone of modern enterprise AI.
Tomorrow: We will give our AI hands. You will learn Function Calling—how to let an LLM actually execute Python code, check the weather, or send emails, rather than just talking about it.