Technical Interview Prep
What You'll Build Today
You have spent the last 78 days building incredible things. You have built a semantic search engine, an autonomous agent, and a full-stack RAG application. You have the skills.
But today, we build something different: We are going to build your narrative.
In a technical interview, the best code in the world doesn't matter if you cannot explain how it works or why you made specific design choices. Today, we will construct the "source code" for your interview answers.
Here is what you will master today:
* The "Whiteboard" RAG Explanation: You will write a minimal, clean Python implementation of RAG that serves as the perfect mental model for answering "How does RAG work?"
* System Design Logic: You will build a Semantic Cache system. Why? Because interviewers love asking how you plan to reduce costs and latency for 1 million users.
* Evaluation Metrics: You will code a "hallucination checker." Why? Because the most common follow-up question is "How do you know your AI isn't lying?"
The Problem
Let's imagine you are in an interview for a Junior AI Engineer role. The interviewer leans back and asks a classic question:
"So, I see you built a document chat bot. Can you walk me through how RAG (Retrieval-Augmented Generation) actually works under the hood?"You get nervous. You know how it works—you built one! But your explanation comes out like this:
"Um, well, first you take the PDF and you, like, put it in the database.
But not the whole PDF, you have to cut it up.
And then you use OpenAI to turn words into numbers.
And then when the user asks a question, you find the numbers that match
the other numbers, and then you give that to the AI and it answers."
The interviewer nods politely but looks unimpressed.
Why did this fail?
This answer suggests you followed a tutorial but don't understand the engineering. It feels painful because you do know the engineering, but your words are betraying you.
Let's fix this. We are going to write code that represents the perfect interview answer, then use that code to structure our verbal explanation.
Let's Build It
We will build three code snippets. In an interview, you won't necessarily run this code, but you will describe exactly this logic. If you can write this from scratch, you can explain it to anyone.
Step 1: The "Perfect" RAG Explanation
When asked "How does RAG work?", you should mentally visualize two pipelines: Ingestion (preparing data) and Retrieval (answering questions).
Let's write a minimal, dependency-free simulation of this to ground your explanation.
The Code (Mental Model):import numpy as np
# A simple mock embedding function to simulate the concept
# In reality, this would be OpenAI or HuggingFace
def mock_embed(text):
# Returns a random vector just for demonstration
np.random.seed(len(text))
return np.random.rand(5)
class SimpleRAG:
def __init__(self):
self.vector_db = [] # The Knowledge Base
self.chunks = [] # The actual text storage
# PART 1: INGESTION PIPELINE
# Explain this: "First, we ingest the data by chunking and embedding it."
def ingest(self, text, chunk_size=20):
# 1. Chunking
words = text.split()
current_chunks = [
" ".join(words[i:i+chunk_size])
for i in range(0, len(words), chunk_size)
]
# 2. Embedding & Indexing
for chunk in current_chunks:
vector = mock_embed(chunk)
self.vector_db.append(vector)
self.chunks.append(chunk)
print(f"Ingested {len(current_chunks)} chunks.")
# PART 2: RETRIEVAL PIPELINE
# Explain this: "Then, we retrieve relevant context using cosine similarity."
def retrieve(self, query):
# 1. Embed Query
query_vector = mock_embed(query)
# 2. Vector Search (Cosine Similarity)
scores = []
for doc_vector in self.vector_db:
dot_product = np.dot(query_vector, doc_vector)
norm_a = np.linalg.norm(query_vector)
norm_b = np.linalg.norm(doc_vector)
similarity = dot_product / (norm_a * norm_b)
scores.append(similarity)
# 3. Get Top Result
best_idx = np.argmax(scores)
return self.chunks[best_idx]
# PART 3: GENERATION
# Explain this: "Finally, we inject context into the prompt."
def generate(self, query):
context = self.retrieve(query)
prompt = f"""
System: You are a helpful assistant.
Context: {context}
User Question: {query}
"""
return prompt
# Let's run the mental model
rag = SimpleRAG()
text_data = "RAG stands for Retrieval Augmented Generation. It combines a retriever with a generator. This helps reduce hallucinations."
rag.ingest(text_data, chunk_size=5)
final_prompt = rag.generate("What is RAG?")
print("--- Final Prompt Sent to LLM ---")
print(final_prompt)
Why this matters:
This code is your script. When answering, you literally describe these functions: "My system has an ingestion pipeline that chunks and embeds text, and a retrieval pipeline that performs cosine similarity search before prompting the LLM."
Step 2: System Design (Scaling to 1M Users)
The interviewer asks: "Your chatbot works great for one user. How would you design it to handle 1 million users without bankrupting us?"
The wrong answer is: "I would get a bigger server."
The right answer involves Caching. LLM calls are expensive and slow. If 1,000 people ask "What is the refund policy?", you should only call the LLM once.
Let's build a Semantic Cache.
import time
class SemanticCache:
def __init__(self):
self.cache = {} # Maps query_vector_bytes -> response
self.threshold = 0.95 # How similar queries must be to hit cache
def get_embedding(self, text):
# Simulating embedding again
np.random.seed(len(text))
return np.random.rand(5)
def check_cache(self, user_query):
query_vec = self.get_embedding(user_query)
# Check against all cached queries (in production, use a Vector DB for this)
best_score = -1
best_response = None
for cached_query_text, response in self.cache.items():
cached_vec = self.get_embedding(cached_query_text)
# Calculate Similarity
similarity = np.dot(query_vec, cached_vec) / (np.linalg.norm(query_vec) * np.linalg.norm(cached_vec))
if similarity > best_score:
best_score = similarity
best_response = response
if best_score > self.threshold:
return best_response, best_score
return None, best_score
def add_to_cache(self, user_query, llm_response):
self.cache[user_query] = llm_response
# Simulation
system = SemanticCache()
# User 1 asks a question (Expensive LLM Call)
q1 = "How do I reset my password?"
print(f"User 1: {q1}")
# Simulate LLM call
llm_response = "Go to settings and click 'Reset Password'."
system.add_to_cache(q1, llm_response)
print("-> Added to cache.\n")
# User 2 asks a SLIGHTLY different question
q2 = "How can I change my password?"
print(f"User 2: {q2}")
cached_resp, score = system.check_cache(q2)
if cached_resp:
print(f"-> CACHE HIT! (Similarity: {score:.4f})")
print(f"-> Returning: {cached_resp}")
else:
print("-> Cache Miss. Calling LLM...")
Why this matters:
This demonstrates architectural maturity. You aren't just calling APIs; you are thinking about latency and cost. You can explain: "To scale to 1M users, I implemented semantic caching. If a user asks a question semantically similar to a previous one, we serve the cached response instantly, skipping the expensive LLM call."
Step 3: Handling Hallucinations
The interviewer asks: "How do you prevent the AI from making things up?"
You cannot prevent it 100%, but you can detect it. Let's build a simple "Grounding Check."
def check_grounding(response, context):
"""
A simple heuristic check. In production, use an LLM-as-a-Judge.
Here, we check if key terms from the response exist in the context.
"""
response_words = set(response.lower().replace('.', '').split())
context_words = set(context.lower().replace('.', '').split())
# Calculate overlap
overlap = response_words.intersection(context_words)
score = len(overlap) / len(response_words)
return score
# Example
context = "The store is open from 9 AM to 5 PM on weekdays."
good_response = "We are open 9 AM to 5 PM."
bad_response = "We are open 24/7 on weekends."
score_good = check_grounding(good_response, context)
score_bad = check_grounding(bad_response, context)
print(f"Good Response Grounding Score: {score_good:.2f}")
print(f"Bad Response Grounding Score: {score_bad:.2f}")
Why this matters:
This gives you a concrete answer: "I implement a post-generation verification step. I compare the generated response against the retrieved context to calculate a 'fact overlap' score. If the score is too low, I don't show the answer to the user."
Now You Try
Take the code concepts above and extend them to cover three more interview "gotcha" questions.
1. The "Re-ranking" Question
Interviewers often ask: "Vector search sometimes retrieves irrelevant documents. How do you fix that?"
* Task: Modify the SimpleRAG class. After getting the top 5 results from vector search, implement a dummy rerank method that re-sorts them based on a secondary logic (e.g., length of text or presence of a specific keyword).
* Goal: Be able to explain "Cross-Encoder Re-ranking" (retrieving many documents cheaply, then sorting the best ones using a smarter model).
2. The "Streaming" Question
Interviewers ask: "The bot takes 10 seconds to answer. Users are leaving. What do you do?"
* Task: Write a Python generator function (using yield) that simulates streaming a response word-by-word with a small time delay (time.sleep(0.1)).
* Goal: Demonstrate you understand how to improve User Experience (UX) by showing progress immediately, rather than waiting for the full generation.
3. The "Hybrid Search" Question
Interviewers ask: "Vector search fails on specific product names (like 'Model X-99'). Why?"
Task: Write a search function that takes a query and checks for exact keyword matches* first. If it finds an exact match, return that. If not, proceed to vector search. Goal: Explain that embeddings capture meaning, but sometimes we need exact keywords* (BM25/Keyword Search).Challenge Project: The Mock Interview Simulation
Your challenge today is not just to write code, but to perform. You will simulate a technical interview.
Requirements:* The Problem: What pain point did you solve?
* The Solution: What did you build (high-level)?
* The Tech Stack: Specifically mention Python, LangChain/LlamaIndex, Vector DB, etc.
* The Hardest Part: One specific technical challenge you overcame.
* If you have a study partner, present to them.
* If solo, record a voice memo on your phone.
* Write a small code snippet (pseudo-code is fine) demonstrating a "fallback mechanism" or a "router" that detects unrelated queries.
Example "Capstone Pitch" Structure:> "I built an autonomous research agent because I found manual Googling inefficient. My solution uses Python and OpenAI to recursively search the web. The tech stack includes ChromaDB for memory and Streamlit for the UI. The hardest challenge was the agent getting stuck in loops; I solved this by implementing a 'thought history' check that penalizes repetitive actions."
What You Learned
Today, you translated your coding skills into communication skills.
* RAG Deep Dive: You learned to explain RAG not as magic, but as a pipeline of Chunking, Embedding, Retrieval, and Generation.
* System Design: You learned that scaling AI isn't about bigger models, but about architecture (Caching, Latency, Hybrid Search).
* Evaluation: You learned that "it looks good" isn't a metric—Grounding and Hallucination detection are.
Why This Matters:You can be the best coder in the room, but if you can't articulate why you wrote the code, you won't get the job. These mental models act as your anchor during high-pressure interviews.
Tomorrow: It all comes down to this. We take your resume, your portfolio, and your new interview skills, and we launch your career.