Day 28 of 80

Embeddings & Semantic Search

Phase 3: LLM Landscape & APIs

What You'll Build Today

Welcome to Day 28! Today marks a massive shift in how you will interact with language models. Up until now, we have been treating LLMs like very smart chatbots: we send them text, and they send text back.

But there is a problem. What if you have a company handbook with 5,000 pages of PDF documents? You cannot paste 5,000 pages into ChatGPT's prompt box. It is too expensive, and it exceeds the context limit.

You need a way to find only the relevant paragraphs and send those to the LLM. But standard search (Ctrl+F) is terrible. If you search for "vacation policy" but the handbook calls it "time off guidelines," Ctrl+F fails.

Today, we are building a Semantic Search Tool. You will write code that converts text into lists of numbers (vectors) representing their meaning.

Here is what you will learn:

* Embeddings: Why converting text into numbers allows computers to understand concepts, not just keywords.

* Vector Math (Simplified): How to mathematically calculate how similar two sentences are.

* Semantic Search: Building a search engine that knows "canine" and "dog" are related, even if they don't share any letters.

* The "Apple" Problem: Proving that the computer can tell the difference between Apple (the fruit) and Apple (the tech company) based on context.

Let's turn words into math.

The Problem

Let's look at why the "Ctrl+F" approach (keyword search) is broken for AI applications.

Imagine you are building a support bot for an internet provider. You have a list of common issues in your database. A user asks: "My internet is dead."

If you use standard Python string matching, look what happens:

# A simple database of support tickets / FAQ titles

database = [

"Wi-Fi signal is weak in the basement",

"Billing: How to update credit card",

"Router lights are not blinking",

"Connection speed is very slow"

]

user_query = "My internet is dead"

# The "Old Way": Keyword Search

found_match = False

print(f"Searching for: '{user_query}'...")

for entry in database:

# We check if words from the query appear in the database entry # This is a naive keyword search

if "internet" in entry.lower() or "dead" in entry.lower():

print(f"MATCH FOUND: {entry}")

found_match = True

if not found_match:

print("No relevant results found.")

The Output:
Searching for: 'My internet is dead'...

No relevant results found.

The Pain:

This is incredibly frustrating. As humans, we know that "Router lights are not blinking" is very relevant to "My internet is dead." But the computer doesn't know that. It looks for the string "internet" or "dead" and finds neither.

To fix this using the old way, you would have to write hundreds of if statements: if "dead" or "broken" or "down" or "offline".... It is impossible to maintain.

We need a way to search by meaning, not by spelling.

Let's Build It

We are going to use OpenAI's Embedding API. This API doesn't generate text (like GPT-4); it generates a list of floating-point numbers called a vector.

When two pieces of text have similar meanings, their lists of numbers look mathematically similar.

Step 1: Setup and Libraries

We need openai for the API and a library called numpy to handle the math. If you haven't installed numpy yet, you will need to do so.

In your terminal:

``bash

pip install openai numpy

`

Now, create your file semantic_search.py.

import os

import numpy as np

from openai import OpenAI

# Initialize the client # Make sure your OPENAI_API_KEY is set in your environment variables

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

print("Libraries loaded and client ready.")

Step 2: Get Your First Embedding

Let's see what an embedding actually looks like. We will send the word "Apple" to the model text-embedding-3-small.

def get_embedding(text):

"""

Takes a string of text and returns a list of floats (the vector).

"""

response = client.embeddings.create(

model="text-embedding-3-small",

input=text

)

# Extract the vector from the response

return response.data[0].embedding

# Let's test it

word = "Apple"

vector = get_embedding(word)

print(f"Word: {word}")

print(f"Vector length: {len(vector)}")

print(f"First 10 numbers: {vector[:10]}")

Run this code. You will see a list of numbers like
[-0.012, 0.045, ...].

* Why this matters: The model text-embedding-3-small turns any text into a list of 1,536 numbers. These numbers represent the "coordinates" of that word in a massive multi-dimensional concept space.

Step 3: The Math of Similarity (Cosine Similarity)

How do we compare two lists of 1,536 numbers? We calculate the "angle" between them.

* If the angle is 0, they are identical (Similarity = 1.0).

* If they point in opposite directions, they are opposites (Similarity = -1.0).

* If they are unrelated (90 degrees), the similarity is 0.

This is called Cosine Similarity. We will write a helper function using numpy to calculate this.

Add this function to your code:

def cosine_similarity(a, b):

"""

Calculates the cosine similarity between two vectors.

Returns a score between -1 (opposite) and 1 (identical).

"""

# Convert lists to numpy arrays for faster math

vec_a = np.array(a)

vec_b = np.array(b)

# The formula: (A . B) / (|A| * |B|)

dot_product = np.dot(vec_a, vec_b)

norm_a = np.linalg.norm(vec_a)

norm_b = np.linalg.norm(vec_b)

return dot_product / (norm_a * norm_b)

# Test the math

vec1 = [1, 1, 1]

vec2 = [1, 1, 1] # Identical

vec3 = [-1, -1, -1] # Opposite

print(f"Identical similarity: {cosine_similarity(vec1, vec2)}")

print(f"Opposite similarity: {cosine_similarity(vec1, vec3)}")

Step 4: The "Apple" Test

Now for the magic. We will compare the word "Apple" against "Fruit" and "iPhone".

Wait—"Apple" is ambiguous. Does the computer know which Apple we mean? Let's see if context changes the math.

print("\n--- The Apple Context Test ---")

# 1. Embed the word "Apple" alone

vec_apple = get_embedding("Apple")

# 2. Embed related concepts

vec_fruit = get_embedding("Fruit")

vec_iphone = get_embedding("iPhone")

vec_dog = get_embedding("Dog")

# 3. Compare

score_fruit = cosine_similarity(vec_apple, vec_fruit)

score_iphone = cosine_similarity(vec_apple, vec_iphone)

score_dog = cosine_similarity(vec_apple, vec_dog)

print(f"Similarity 'Apple' vs 'Fruit': {score_fruit:.4f}")

print(f"Similarity 'Apple' vs 'iPhone': {score_iphone:.4f}")

print(f"Similarity 'Apple' vs 'Dog': {score_dog:.4f}")

Run this. You will likely see that Apple is somewhat close to both Fruit and iPhone, but very far from Dog.

But now, let's try Contextual Sentences.

print("\n--- Contextual Sentence Test ---")

sentence_1 = "I like to eat apples"

sentence_2 = "I like to use my iPhone"

sentence_3 = "The company Apple just released a new product"

sentence_4 = "The fruit apple is red and tasty"

# Let's see which sentence is closer to "The company Apple..." (Sentence 3)

vec_1 = get_embedding(sentence_1)

vec_2 = get_embedding(sentence_2)

vec_3 = get_embedding(sentence_3) # This is our anchor

vec_4 = get_embedding(sentence_4)

print(f"Query: '{sentence_3}'")

print(f"Score vs '{sentence_1}' (Fruit context): {cosine_similarity(vec_3, vec_1):.4f}")

print(f"Score vs '{sentence_2}' (Tech context): {cosine_similarity(vec_3, vec_2):.4f}")

The Result: You should see that "The company Apple..." has a higher similarity score with "I like to use my iPhone" than with "I like to eat apples," even though the word "Apple" appears in the fruit sentence! The model understands the intent.

Step 5: Fixing the Support Bot

Let's solve the problem from the beginning of the lesson. We will build a semantic search loop.

print("\n--- Semantic Search Engine ---")

database = [

"Wi-Fi signal is weak in the basement",

"Billing: How to update credit card",

"Router lights are not blinking",

"Connection speed is very slow"

]

user_query = "My internet is dead"

# 1. Embed the query

query_vec = get_embedding(user_query)

# 2. Search the database

best_score = -1

best_match = ""

print(f"Query: {user_query}")

for entry in database:

# Embed the database entry # (In production, you would do this once and save it, not every time!)

entry_vec = get_embedding(entry)

score = cosine_similarity(query_vec, entry_vec)

print(f" - '{entry}': {score:.4f}")

if score > best_score:

best_score = score

best_match = entry

print(f"\nWINNER: '{best_match}' with score {best_score:.4f}")

Run this.

Even though "My internet is dead" shares zero keywords with "Router lights are not blinking," the similarity score will be high (likely > 0.4 or 0.5), while "Billing" will be low.

You have just built a search engine that thinks like a human.

Now You Try

You have the basics. Now, experiment with these three extensions to solidify your knowledge.

  • The Threshold Filter:
  • Modify the search loop. Instead of just finding the best match, print all matches that have a similarity score above 0.4. Test it with a query that might apply to two items (e.g., "internet money" might match both billing and connection).

  • Cross-Lingual Search:
  • Embed the English sentence "Where is the bathroom?"

    Then, create a database with Spanish ("Donde esta el bano") and French ("Ou sont les toilettes").

    Run the search. Does the English query match the Spanish text? (Spoiler: Yes, embeddings capture meaning across languages!)

  • The "Opposite" Game:
  • Create a list of emotions: "Happy", "Sad", "Angry", "Excited".

    Embed "Tragedy".

    Write code to find the word with the lowest cosine similarity score (the most unrelated or opposite concept).

    Challenge Project: The Relevance Sorter

    Your challenge is to create a reusable function that takes a query and a list of documents, and returns the documents sorted from most relevant to least relevant.

    Requirements:
  • Define a function sort_by_relevance(query, documents).
  • documents should be a list of strings.
  • The function must embed the query and every document.
  • It must calculate similarity scores for all of them.
  • It must return a list of tuples: (score, document_text), sorted descending by score.
  • Print the results nicely formatted.
  • Example Input:
    query = "What should I feed my cat?"
    

    docs = [

    "Toyota Camrys are reliable cars.",

    "Felines enjoy eating tuna and chicken.",

    "Dogs are loyal companions.",

    "It is raining outside."

    ]

    Expected Output:
    Query: What should I feed my cat?
    

    --------------------------------------------------

    0.6521 - Felines enjoy eating tuna and chicken.

    0.3412 - Dogs are loyal companions.

    0.1205 - Toyota Camrys are reliable cars.

    0.0912 - It is raining outside.

    Hint: In Python, you can sort a list of tuples using
    results.sort(key=lambda x: x[0], reverse=True)`.

    What You Learned

    Today you unlocked the ability to make applications that "understand" meaning.

    * Embeddings turn text into vectors (lists of numbers).

    * Cosine Similarity measures how close two meanings are.

    * Semantic Search is superior to keyword search because it handles synonyms ("canine" vs "dog") and context ("Apple inc" vs "Apple pie").

    Why This Matters:

    In the real world, you cannot feed a 500-page manual into GPT-4. It is too big. Instead, you use the techniques from today to:

  • Chunk the manual into paragraphs.
  • Embed all paragraphs.
  • When a user asks a question, search for the top 3 most similar paragraphs.
  • Send only those 3 paragraphs to GPT-4 to generate the answer.
  • This pattern is called RAG (Retrieval Augmented Generation), and it is the backbone of modern enterprise AI.

    Tomorrow: We will give our AI hands. You will learn Function Calling—how to let an LLM actually execute Python code, check the weather, or send emails, rather than just talking about it.