Day 79 of 80

Technical Interview Prep

Phase 9: Capstone & Career

What You'll Build Today

You have spent the last 78 days building incredible things. You have built a semantic search engine, an autonomous agent, and a full-stack RAG application. You have the skills.

But today, we build something different: We are going to build your narrative.

In a technical interview, the best code in the world doesn't matter if you cannot explain how it works or why you made specific design choices. Today, we will construct the "source code" for your interview answers.

Here is what you will master today:

* The "Whiteboard" RAG Explanation: You will write a minimal, clean Python implementation of RAG that serves as the perfect mental model for answering "How does RAG work?"

* System Design Logic: You will build a Semantic Cache system. Why? Because interviewers love asking how you plan to reduce costs and latency for 1 million users.

* Evaluation Metrics: You will code a "hallucination checker." Why? Because the most common follow-up question is "How do you know your AI isn't lying?"

The Problem

Let's imagine you are in an interview for a Junior AI Engineer role. The interviewer leans back and asks a classic question:

"So, I see you built a document chat bot. Can you walk me through how RAG (Retrieval-Augmented Generation) actually works under the hood?"

You get nervous. You know how it works—you built one! But your explanation comes out like this:

"Um, well, first you take the PDF and you, like, put it in the database. 

But not the whole PDF, you have to cut it up.

And then you use OpenAI to turn words into numbers.

And then when the user asks a question, you find the numbers that match

the other numbers, and then you give that to the AI and it answers."

The interviewer nods politely but looks unimpressed.

Why did this fail?

  • Imprecise Terminology: "Cut it up" instead of "Chunking." "Numbers" instead of "Embeddings."
  • No Architecture: It sounds linear and messy, missing the distinction between Ingestion and Inference.
  • Missing the "Why": You didn't explain why we compare numbers (Cosine Similarity) or why we give it to the AI (Context Injection).
  • This answer suggests you followed a tutorial but don't understand the engineering. It feels painful because you do know the engineering, but your words are betraying you.

    Let's fix this. We are going to write code that represents the perfect interview answer, then use that code to structure our verbal explanation.

    Let's Build It

    We will build three code snippets. In an interview, you won't necessarily run this code, but you will describe exactly this logic. If you can write this from scratch, you can explain it to anyone.

    Step 1: The "Perfect" RAG Explanation

    When asked "How does RAG work?", you should mentally visualize two pipelines: Ingestion (preparing data) and Retrieval (answering questions).

    Let's write a minimal, dependency-free simulation of this to ground your explanation.

    The Code (Mental Model):
    import numpy as np
    
    # A simple mock embedding function to simulate the concept
    # In reality, this would be OpenAI or HuggingFace
    

    def mock_embed(text):

    # Returns a random vector just for demonstration

    np.random.seed(len(text))

    return np.random.rand(5)

    class SimpleRAG:

    def __init__(self):

    self.vector_db = [] # The Knowledge Base

    self.chunks = [] # The actual text storage

    # PART 1: INGESTION PIPELINE # Explain this: "First, we ingest the data by chunking and embedding it."

    def ingest(self, text, chunk_size=20):

    # 1. Chunking

    words = text.split()

    current_chunks = [

    " ".join(words[i:i+chunk_size])

    for i in range(0, len(words), chunk_size)

    ]

    # 2. Embedding & Indexing

    for chunk in current_chunks:

    vector = mock_embed(chunk)

    self.vector_db.append(vector)

    self.chunks.append(chunk)

    print(f"Ingested {len(current_chunks)} chunks.")

    # PART 2: RETRIEVAL PIPELINE # Explain this: "Then, we retrieve relevant context using cosine similarity."

    def retrieve(self, query):

    # 1. Embed Query

    query_vector = mock_embed(query)

    # 2. Vector Search (Cosine Similarity)

    scores = []

    for doc_vector in self.vector_db:

    dot_product = np.dot(query_vector, doc_vector)

    norm_a = np.linalg.norm(query_vector)

    norm_b = np.linalg.norm(doc_vector)

    similarity = dot_product / (norm_a * norm_b)

    scores.append(similarity)

    # 3. Get Top Result

    best_idx = np.argmax(scores)

    return self.chunks[best_idx]

    # PART 3: GENERATION # Explain this: "Finally, we inject context into the prompt."

    def generate(self, query):

    context = self.retrieve(query)

    prompt = f"""

    System: You are a helpful assistant.

    Context: {context}

    User Question: {query}

    """

    return prompt

    # Let's run the mental model

    rag = SimpleRAG()

    text_data = "RAG stands for Retrieval Augmented Generation. It combines a retriever with a generator. This helps reduce hallucinations."

    rag.ingest(text_data, chunk_size=5)

    final_prompt = rag.generate("What is RAG?")

    print("--- Final Prompt Sent to LLM ---")

    print(final_prompt)

    Why this matters:

    This code is your script. When answering, you literally describe these functions: "My system has an ingestion pipeline that chunks and embeds text, and a retrieval pipeline that performs cosine similarity search before prompting the LLM."

    Step 2: System Design (Scaling to 1M Users)

    The interviewer asks: "Your chatbot works great for one user. How would you design it to handle 1 million users without bankrupting us?"

    The wrong answer is: "I would get a bigger server."

    The right answer involves Caching. LLM calls are expensive and slow. If 1,000 people ask "What is the refund policy?", you should only call the LLM once.

    Let's build a Semantic Cache.

    import time
    
    

    class SemanticCache:

    def __init__(self):

    self.cache = {} # Maps query_vector_bytes -> response

    self.threshold = 0.95 # How similar queries must be to hit cache

    def get_embedding(self, text):

    # Simulating embedding again

    np.random.seed(len(text))

    return np.random.rand(5)

    def check_cache(self, user_query):

    query_vec = self.get_embedding(user_query)

    # Check against all cached queries (in production, use a Vector DB for this)

    best_score = -1

    best_response = None

    for cached_query_text, response in self.cache.items():

    cached_vec = self.get_embedding(cached_query_text)

    # Calculate Similarity

    similarity = np.dot(query_vec, cached_vec) / (np.linalg.norm(query_vec) * np.linalg.norm(cached_vec))

    if similarity > best_score:

    best_score = similarity

    best_response = response

    if best_score > self.threshold:

    return best_response, best_score

    return None, best_score

    def add_to_cache(self, user_query, llm_response):

    self.cache[user_query] = llm_response

    # Simulation

    system = SemanticCache()

    # User 1 asks a question (Expensive LLM Call)

    q1 = "How do I reset my password?"

    print(f"User 1: {q1}")

    # Simulate LLM call

    llm_response = "Go to settings and click 'Reset Password'."

    system.add_to_cache(q1, llm_response)

    print("-> Added to cache.\n")

    # User 2 asks a SLIGHTLY different question

    q2 = "How can I change my password?"

    print(f"User 2: {q2}")

    cached_resp, score = system.check_cache(q2)

    if cached_resp:

    print(f"-> CACHE HIT! (Similarity: {score:.4f})")

    print(f"-> Returning: {cached_resp}")

    else:

    print("-> Cache Miss. Calling LLM...")

    Why this matters:

    This demonstrates architectural maturity. You aren't just calling APIs; you are thinking about latency and cost. You can explain: "To scale to 1M users, I implemented semantic caching. If a user asks a question semantically similar to a previous one, we serve the cached response instantly, skipping the expensive LLM call."

    Step 3: Handling Hallucinations

    The interviewer asks: "How do you prevent the AI from making things up?"

    You cannot prevent it 100%, but you can detect it. Let's build a simple "Grounding Check."

    def check_grounding(response, context):
    

    """

    A simple heuristic check. In production, use an LLM-as-a-Judge.

    Here, we check if key terms from the response exist in the context.

    """

    response_words = set(response.lower().replace('.', '').split())

    context_words = set(context.lower().replace('.', '').split())

    # Calculate overlap

    overlap = response_words.intersection(context_words)

    score = len(overlap) / len(response_words)

    return score

    # Example

    context = "The store is open from 9 AM to 5 PM on weekdays."

    good_response = "We are open 9 AM to 5 PM."

    bad_response = "We are open 24/7 on weekends."

    score_good = check_grounding(good_response, context)

    score_bad = check_grounding(bad_response, context)

    print(f"Good Response Grounding Score: {score_good:.2f}")

    print(f"Bad Response Grounding Score: {score_bad:.2f}")

    Why this matters:

    This gives you a concrete answer: "I implement a post-generation verification step. I compare the generated response against the retrieved context to calculate a 'fact overlap' score. If the score is too low, I don't show the answer to the user."

    Now You Try

    Take the code concepts above and extend them to cover three more interview "gotcha" questions.

    1. The "Re-ranking" Question

    Interviewers often ask: "Vector search sometimes retrieves irrelevant documents. How do you fix that?"

    * Task: Modify the SimpleRAG class. After getting the top 5 results from vector search, implement a dummy rerank method that re-sorts them based on a secondary logic (e.g., length of text or presence of a specific keyword).

    * Goal: Be able to explain "Cross-Encoder Re-ranking" (retrieving many documents cheaply, then sorting the best ones using a smarter model).

    2. The "Streaming" Question

    Interviewers ask: "The bot takes 10 seconds to answer. Users are leaving. What do you do?"

    * Task: Write a Python generator function (using yield) that simulates streaming a response word-by-word with a small time delay (time.sleep(0.1)).

    * Goal: Demonstrate you understand how to improve User Experience (UX) by showing progress immediately, rather than waiting for the full generation.

    3. The "Hybrid Search" Question

    Interviewers ask: "Vector search fails on specific product names (like 'Model X-99'). Why?"

    Task: Write a search function that takes a query and checks for exact keyword matches* first. If it finds an exact match, return that. If not, proceed to vector search. Goal: Explain that embeddings capture meaning, but sometimes we need exact keywords* (BM25/Keyword Search).

    Challenge Project: The Mock Interview Simulation

    Your challenge today is not just to write code, but to perform. You will simulate a technical interview.

    Requirements:
  • Prepare your "Capstone Pitch": Write down a 2-minute summary of your capstone project. It must follow this structure:
  • * The Problem: What pain point did you solve?

    * The Solution: What did you build (high-level)?

    * The Tech Stack: Specifically mention Python, LangChain/LlamaIndex, Vector DB, etc.

    * The Hardest Part: One specific technical challenge you overcame.

  • Record or Present:
  • * If you have a study partner, present to them.

    * If solo, record a voice memo on your phone.

  • The Technical Question: Answer the question: "How would you handle a situation where the user asks a question that is not in your document database?"
  • * Write a small code snippet (pseudo-code is fine) demonstrating a "fallback mechanism" or a "router" that detects unrelated queries.

    Example "Capstone Pitch" Structure:

    > "I built an autonomous research agent because I found manual Googling inefficient. My solution uses Python and OpenAI to recursively search the web. The tech stack includes ChromaDB for memory and Streamlit for the UI. The hardest challenge was the agent getting stuck in loops; I solved this by implementing a 'thought history' check that penalizes repetitive actions."

    What You Learned

    Today, you translated your coding skills into communication skills.

    * RAG Deep Dive: You learned to explain RAG not as magic, but as a pipeline of Chunking, Embedding, Retrieval, and Generation.

    * System Design: You learned that scaling AI isn't about bigger models, but about architecture (Caching, Latency, Hybrid Search).

    * Evaluation: You learned that "it looks good" isn't a metric—Grounding and Hallucination detection are.

    Why This Matters:

    You can be the best coder in the room, but if you can't articulate why you wrote the code, you won't get the job. These mental models act as your anchor during high-pressure interviews.

    Tomorrow: It all comes down to this. We take your resume, your portfolio, and your new interview skills, and we launch your career.