Day 66 of 80

Caching & Cost Optimization

Phase 7: Advanced Techniques

Here is the comprehensive content for Day 66 of the GenAI Bootcamp.

What You'll Build Today

Imagine you are running a customer support chatbot. A user asks, "What is your return policy?" Your AI calls OpenAI, processes the request, costs you a fraction of a cent, and takes 2 seconds to generate the answer.

Ten seconds later, another user asks, "Can I return an item?"

Without caching, your AI treats this as a brand new mystery. It calls OpenAI again, costs you money again, and makes the user wait again.

Today, you are going to build a Semantic Caching System. You will give your AI a "memory" so that if it has answered a similar question before, it instantly returns the saved answer for free.

Here is what you will learn:

* Semantic Caching: Why checking for exact text matches isn't enough (because "return policy" and "how to return" mean the same thing but look different).

* Vector Similarity: How to mathematically prove two questions are similar enough to share an answer.

* Cost & Latency Optimization: How to calculate exactly how much money and time you are saving.

* Cache Invalidation: Knowing when a saved answer is too old and needs to be refreshed.

Let's stop burning money on repeated questions.

The Problem

First, let's look at the naive approach. We are going to simulate a scenario where users ask similar questions, and we blindly send every single one to the LLM.

For this code, we will need openai and numpy.

import time
import os
from openai import OpenAI

# Ensure your API key is set in your environment variables
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# A list of questions users might ask
user_queries = [
    "What is the capital of France?",
    "Tell me the capital of France.",  # Same intent, different wording
    "What is the capital of France?",  # Exact duplicate
    "France capital city name?",       # Same intent, very different wording
]

def ask_llm(question):
    print(f"--> Calling OpenAI API for: '{question}'...")
    start_time = time.time()
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
        temperature=0
    )
    
    end_time = time.time()
    latency = end_time - start_time
    answer = response.choices[0].message.content
    
    # Approximate cost calculation (simplified for gpt-4o-mini)
    # Assuming roughly $0.00000015 per token for input+output combined for this simple example
    cost = 0.00005  
    
    return answer, latency, cost

print("--- STARTING NAIVE APPROACH ---")
total_latency = 0
total_cost = 0

for query in user_queries:
    answer, latency, cost = ask_llm(query)
    total_latency += latency
    total_cost += cost
    print(f"    Time: {latency:.2f}s | Cost: ${cost:.5f}")

print(f"\nTotal Time Wasted: {total_latency:.2f}s")
print(f"Total Money Spent: ${total_cost:.5f}")

The Pain Points

Run that code. Watch the console.

It's slow. You will see a pause for every single line. Even for the exact duplicate question, you waited.

It's wasteful. You paid for the answer to "What is the capital of France?" three distinct times.

It's frustrating. As a developer, you know the answer hasn't changed in the last 5 seconds. Why are we asking the "smart" model for a static fact over and over?

There has to be a way to intercept the question, check if we know the answer, and skip the API call.

Let's Build It

We will build a Semantic Cache.

A normal cache (like a dictionary) only works if the keys are identical.

* Query A: "Hello"

* Query B: "Hello" -> Match!

* Query C: "Hi" -> No Match.

A Semantic Cache uses embeddings. It converts the question into numbers (vectors) and checks if the new question is mathematically close to a stored question.

Step 1: Setting up the Vector Tool

We need a way to calculate similarity. In a large production app, you would use a database like Redis for this. Redis is incredibly fast at storing these vectors.

To keep our code runnable today without installing a database server, we will build a lightweight, in-memory version of what Redis does using Python's numpy library.

import numpy as np

def get_embedding(text):
    """Generates a vector embedding for a string."""
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

def cosine_similarity(a, b):
    """Calculates how similar two vectors are (0 to 1)."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Step 2: Creating the Cache Class

This class will act as our middleman. Before calling the LLM, we ask the Cache: "Have we seen anything like this before?"

class SemanticCache:
    def __init__(self, threshold=0.85):
        # This list will act as our database
        # Each item is a dict: {'embedding': vector, 'answer': str, 'original_query': str}
        self.cache = []
        self.threshold = threshold # How similar must it be? (0.0 to 1.0)

    def search(self, query_text):
        """
        1. Embed the new query.
        2. Compare with all stored queries.
        3. If similarity > threshold, return cached answer.
        """
        query_vector = get_embedding(query_text)
        
        best_score = -1
        best_entry = None

        for entry in self.cache:
            score = cosine_similarity(query_vector, entry['embedding'])
            if score > best_score:
                best_score = score
                best_entry = entry
        
        # Check if the best match is good enough
        if best_score >= self.threshold:
            print(f"    [CACHE HIT] Found similar query: '{best_entry['original_query']}' (Score: {best_score:.4f})")
            return best_entry['answer']
        
        print(f"    [CACHE MISS] Best match was only {best_score:.4f}. Calling API...")
        return None

    def add(self, query_text, answer_text):
        """Save a new query and its answer to the cache."""
        vector = get_embedding(query_text)
        self.cache.append({
            'embedding': vector, 
            'answer': answer_text,
            'original_query': query_text
        })

Step 3: The Optimized Workflow

Now, let's write the smart function that combines the Cache and the LLM.

# Initialize our cache system
my_cache = SemanticCache(threshold=0.9) # 0.9 is strict, 0.7 is loose

def smart_ask(question):
    start_time = time.time()
    
    # 1. Check Cache
    cached_answer = my_cache.search(question)
    
    if cached_answer:
        # We found it! Return immediately.
        end_time = time.time()
        return cached_answer, end_time - start_time, 0.0 # $0 cost!
    
    # 2. If not in cache, call LLM
    answer, latency, cost = ask_llm(question)
    
    # 3. Save to cache for next time
    my_cache.add(question, answer)
    
    # Add the embedding time to the total latency
    total_time = (time.time() - start_time) + latency 
    return answer, total_time, cost

Step 4: Testing the Solution

Let's run the exact same queries from "The Problem" section and see the difference.

print("\n--- STARTING SMART CACHING APPROACH ---")
smart_latency = 0
smart_cost = 0

queries_to_test = [
    "What is the capital of France?",      # Should be MISS (first time)
    "Tell me the capital of France.",      # Should be HIT (semantic match)
    "What is the capital of France?",      # Should be HIT (exact match)
    "France capital city name?",           # Should be HIT (semantic match)
    "How do I make a cake?"                # Should be MISS (totally new topic)
]

for query in queries_to_test:
    print(f"\nUser asks: '{query}'")
    answer, latency, cost = smart_ask(query)
    smart_latency += latency
    smart_cost += cost
    print(f"    Response: {answer[:50]}...") # Print first 50 chars
    print(f"    Time: {latency:.2f}s | Cost: ${cost:.5f}")

print(f"\nTotal Time: {smart_latency:.2f}s")
print(f"Total Money: ${smart_cost:.5f}")

The Result

You should see something amazing in your output.

The first query is a MISS. It takes ~1.5 seconds.

The second query ("Tell me the...") is a HIT. The code detects it means the same thing. It returns instantly (0.0s - 0.3s depending on embedding speed). Cost is $0.

The third query is a HIT.

The fourth query is a HIT.

You just answered 4 questions but only paid for 1. That is a 75% cost reduction.

Now You Try

You have a working semantic cache. Now, let's make it production-ready.

1. Tune the Threshold

In the code above, we set threshold=0.9.

* Change it to 0.99. Run the queries again. Does "France capital city name" still hit the cache? (Likely not, it's too strict).

* Change it to 0.6. Ask "What is the capital of Spain?" after asking about France. Does it incorrectly return Paris? (Likely yes, it's too loose).

* Find the "sweet spot" for your specific questions.

2. Implement "Cache Eviction" (Limit Size)

Our current list self.cache grows forever. In a real app, this crashes memory.

Modify the add method in SemanticCache. If len(self.cache) > 5, remove the oldest item (index 0) before adding a new one. This is a basic queue.

3. Return Cache Stats

Modify the SemanticCache class to keep track of self.hits and self.misses. Add a method get_stats() that prints:

* Total Queries

* Hit Rate % (Hits / Total)

Estimated Savings (Hits $0.00005)

Challenge Project

Objective: Simulate a high-volume traffic scenario and generate a "Savings Report." Requirements:

Create a list of 20 queries.

* 5 asking about Python (varied wording).

* 5 asking about SQL (varied wording).

* 5 asking about JavaScript (varied wording).

* 5 completely random questions.

Run these 20 queries through your smart_ask system.

Your SemanticCache must calculate the similarity.

Print a final report comparing "Projected Cost (No Cache)" vs "Actual Cost (With Cache)".

Example Output:

--- FINAL REPORT ---
Total Queries: 20
Cache Hits: 14
Cache Misses: 6
Hit Rate: 70%

Cost without Cache: $0.00100
Actual Cost:        $0.00030
Money Saved:        $0.00070
Time Saved:         ~21.5 seconds

Hint: You don't need to actually call the OpenAI LLM for the "Answer" part if you want to save your own credits while testing. You can mock the ask_llm function to return a dummy string like "Here is the answer from OpenAI..." and a fake cost, but you must use the real get_embedding function to test the caching logic.

Common Mistakes

1. The "Everything Looks Similar" Bug

* Mistake: Setting the similarity threshold too low (e.g., 0.5).

* Result: The user asks "What is a cat?" and gets an answer about a "Car" because the words share letters or general context.

* Fix: For semantic search, thresholds usually need to be high (0.80 - 0.95). Always test your "false positives."

2. Caching Personal Data

* Mistake: Caching a query like "What is my balance?" -> Answer: "$500".

Result: The next user asks "What is my balance?" and gets the previous* user's balance because the cache matched the question.

* Fix: Never cache user-specific data globally. Cache general facts only, or include the User ID in the cache key.

3. Ignoring "Freshness" (Stale Data)

* Mistake: Caching "What is the stock price of Apple?"

* Result: The user gets yesterday's price instantly. Fast, but wrong.

* Fix: Implement a "Time To Live" (TTL). If the cache entry is older than 5 minutes, ignore it and fetch fresh data.

4. Not Using a Real Database in Production

* Mistake: Using a Python list (like we did today) for 1 million users.

* Result: Your server crashes because it runs out of RAM.

* Fix: Today we used a list for learning. In the real world, use Redis, Pinecone, or ChromaDB. They are built to handle millions of vectors efficiently.

Quick Quiz

Q1: Why is "Semantic Caching" better than a standard dictionary look-up for LLM queries?

a) It is faster to execute.

b) It uses less memory.

c) It can identify questions that have the same meaning but different wording.

d) It doesn't require an API key.

Answer: c

Q2: What happens if your similarity threshold is set too high (e.g., 0.999)?

a) You will get too many wrong answers.

b) You will almost never hit the cache, saving no money.

c) The system will crash.

d) The embeddings become invalid.

Answer: b

Q3: Which of these queries is SAFE to cache globally for all users?

a) "What is the current time?"

b) "Who is the President of the US?"

c) "What is my credit card number?"

d) "What is the weather at my current GPS location?"

Answer: b

What You Learned

Today you moved from a "naive" API consumer to a "smart" system architect.

Semantic Caching: You learned that caching isn't just about exact text matches; it's about matching intent* using vectors.

* Cost Optimization: You saw firsthand how a 70-80% hit rate translates directly to 70-80% lower bills.

* Latency: You realized that the best way to speed up an AI application is to avoid calling the AI at all when possible.

Why This Matters:

In a real enterprise application, you pay per token. If you have 10,000 employees asking "How do I reset my password?" every month, semantic caching turns a $500 monthly bill into a $5 bill. It transforms your application from a cool demo into a viable business product.

Tomorrow: We tackle the wait time. Even with caching, sometimes you must call the LLM, and waiting 5 seconds for a response feels like an eternity. Tomorrow, we learn Streaming, so your users see the answer typing out in real-time.

← Day 65 Day 67 →