Day 16 of 80

Database Overview & Architecture

Phase 2: Software Foundations

What You'll Build Today

Welcome to Day 16. We have spent the last two weeks learning Python logic, functions, and how to call API endpoints. You can now build a script that sends a prompt to an AI and gets an answer.

But you have a major problem: your AI has amnesia. Every time you restart your script, it forgets everything the user just said. It forgets who the user is. It forgets the documents you gave it to read.

Today, we aren't writing a full application. Instead, we are designing the Architecture of an AI application. We will build a "Database Simulator" in Python to understand the three distinct types of memory an AI system needs.

Here is what you will learn:

* Relational Databases (SQL): Why we need rigid tables for user data (like Excel sheets).

* NoSQL Databases (Document Stores): Why we need flexible storage for chat history (like a file cabinet).

* Vector Databases: The secret sauce of AI—how to store "meaning" so the AI can find relevant information later.

* System Design: How to combine all three to build a ChatGPT clone.

Let's fix the amnesia.

The Problem

Imagine you have written a Python script for a mental health chatbot. A user, "Sarah," talks to it for 30 minutes about her anxiety. She closes the program. The next day, she opens it again.

Here is the code representing that situation:

# The "Stateless" Chatbot

class ChatBot:

def __init__(self):

self.memory = []

def chat(self, user_input):

self.memory.append(user_input)

return f"I heard you say: {user_input}"

# Simulation of Day 1

bot = ChatBot()

print("--- Day 1 ---")

print(bot.chat("Hi, I'm Sarah and I feel anxious."))

print(f"Current Memory: {bot.memory}")

# Simulation of Day 2 (Restarting the program) # When the program ends, the 'bot' variable is destroyed. # We have to create a new instance.

bot_new = ChatBot()

print("\n--- Day 2 ---")

print(bot_new.chat("Do you remember my name?"))

print(f"Current Memory: {bot_new.memory}")

The Output:
--- Day 1 ---

I heard you say: Hi, I'm Sarah and I feel anxious.

Current Memory: ["Hi, I'm Sarah and I feel anxious."]

--- Day 2 ---

I heard you say: Do you remember my name?

Current Memory: ['Do you remember my name?']

The Pain:

Sarah is frustrated. The bot has no idea who she is. To fix this, you might try saving the conversation to a text file history.txt.

But then you hit new problems:

  • Concurrency: What if Sarah and Mike talk to the bot at the same time? Writing to the same text file simultaneously will corrupt the data.
  • Search: If the history gets to be 10,000 lines long, reading the whole file to find "anxiety triggers" becomes incredibly slow.
  • Structure: How do you separate Sarah's password from her chat logs?
  • We need specialized tools for this. We need databases.

    Let's Build It

    We are going to write Python code that simulates how actual production databases work. We will build a "User Store" (SQL), a "Chat Log" (NoSQL), and a "Knowledge Base" (Vector DB).

    Step 1: The Relational (SQL) Simulation

    Relational databases (like PostgreSQL or SQLite) are strict. They are like Excel spreadsheets where you define the columns upfront. If you try to put text into a "Age" column, the database yells at you. This is perfect for User Accounts where consistency is critical.

    Let's simulate a strict User Table.

    # Simulating a SQL Table
    # In a real DB, this is defined by a "Schema"
    

    users_table = []

    def add_user(id, username, email, is_paid):

    # 1. Enforce Structure (The "Pain" of SQL, but also its safety)

    if not isinstance(id, int):

    print(f"Error: User {username} rejected. ID must be an integer.")

    return

    if not isinstance(is_paid, bool):

    print(f"Error: User {username} rejected. is_paid must be True/False.")

    return

    # 2. Create the row

    row = {

    "id": id,

    "username": username,

    "email": email,

    "is_paid": is_paid

    }

    users_table.append(row)

    print(f"Success: User {username} added to SQL Table.")

    # Let's try to add data

    add_user(1, "sarah_j", "sarah@email.com", True)

    add_user(2, "mike_k", "mike@email.com", False)

    # This will fail because the structure is wrong (ID is a string)

    add_user("three", "bad_entry", "bad@email.com", True)

    print("\n--- Current SQL Table ---")

    for row in users_table:

    print(row)

    Why this matters:

    You need this strictness for login systems. You cannot have a user without a password or an ID. SQL databases ensure your critical business data is clean.

    Step 2: The NoSQL (Document) Simulation

    Now, think about the chat history. One conversation might be two lines long. Another might be 500 lines with images and code snippets. A rigid table is bad here. We need flexibility.

    NoSQL databases (like MongoDB) store data as "Documents" (similar to Python Dictionaries or JSON).

    import time
    
    # Simulating a NoSQL Collection
    

    chat_logs_collection = []

    def save_chat_log(user_id, conversation_data):

    # NoSQL doesn't care about strict columns. # We just wrap the data in a document with a timestamp.

    document = {

    "timestamp": time.time(),

    "user_id": user_id,

    "data": conversation_data # This can be ANY shape or size

    }

    chat_logs_collection.append(document)

    print(f"Saved chat log for User ID {user_id}")

    # Scenario 1: Sarah has a simple text chat

    sarah_chat = [

    {"role": "user", "text": "I feel anxious"},

    {"role": "bot", "text": "Tell me more."}

    ]

    save_chat_log(1, sarah_chat)

    # Scenario 2: Mike has a complex chat with metadata

    mike_chat = {

    "session_name": "Coding Help",

    "messages": [

    {"role": "user", "text": "Debug this code"},

    {"role": "bot", "text": "Here is the fix..."}

    ],

    "language": "Python",

    "tokens_used": 150

    }

    save_chat_log(2, mike_chat)

    print("\n--- Current NoSQL Collection ---")

    for doc in chat_logs_collection:

    print(doc)

    Why this matters:

    Notice that Sarah's data is a List, but Mike's data is a Dictionary with extra keys like "tokens_used". The database didn't crash. NoSQL allows your data structure to evolve over time, which is perfect for messy chat logs.

    Step 3: The Vector Database Simulation

    This is the most important concept for modern AI.

    Standard databases search for keywords. If you search for "dog", they find "dog".

    Vector databases search for meaning. If you search for "dog", they might find "puppy" or "canine".

    They do this by turning text into lists of numbers (vectors). Text with similar meanings has numbers that are "close" to each other mathematically.

    import math
    
    # A mock "Knowledge Base"
    # In reality, an AI model creates these numbers. 
    # Here, we will manually assign coordinates to represent meaning.
    # Imagine a 2D map: 
    # [0, 1] is "Animal concepts"
    # [1, 0] is "Tech concepts"
    
    

    vector_db = {

    "puppy": [0.1, 0.9], # High on animal axis

    "kitten": [0.2, 0.8], # High on animal axis

    "laptop": [0.9, 0.1], # High on tech axis

    "server": [0.8, 0.2] # High on tech axis

    }

    def get_similarity(query_vector, stored_vector):

    # A simplified distance check (Euclidean distance) # The smaller the distance, the more similar they are.

    x_diff = query_vector[0] - stored_vector[0]

    y_diff = query_vector[1] - stored_vector[1]

    distance = math.sqrt(x_diff2 + y_diff2)

    return distance

    def vector_search(query_vector):

    print(f"Searching for vector closest to: {query_vector}")

    best_match = None

    lowest_distance = float('inf') # Start with infinity

    for word, vector in vector_db.items():

    dist = get_similarity(query_vector, vector)

    print(f"Distance to '{word}': {dist:.2f}")

    if dist < lowest_distance:

    lowest_distance = dist

    best_match = word

    return best_match

    # Let's search! # We want to find something related to "computer" # Let's say "computer" would be at [0.85, 0.15]

    query = [0.85, 0.15]

    result = vector_search(query)

    print(f"\nWinner: The concept most similar to our query is '{result}'")

    Why this matters:

    The search query [0.85, 0.15] (representing "computer") was mathematically closest to "laptop" and "server". It was far away from "puppy". This allows AI to find relevant documents even if the user doesn't use the exact right keywords.

    Step 4: The Architecture Diagram (Code Version)

    Now we put them together. This is the architecture of a standard AI application (like a ChatGPT clone).

    class AIChatApp:
    

    def __init__(self):

    self.sql_users = {} # Structured User Data

    self.nosql_chats = [] # Flexible History

    self.vector_data = {} # Knowledge Base

    def register_user(self, username):

    # SQL Logic

    self.sql_users[username] = {"active": True, "plan": "free"}

    print(f"[SQL] Registered {username}")

    def save_conversation(self, username, message):

    # NoSQL Logic

    record = {"user": username, "msg": message, "time": "now"}

    self.nosql_chats.append(record)

    print(f"[NoSQL] Saved chat for {username}")

    def find_context(self, query_concept):

    # Vector Logic (Simplified)

    print(f"[Vector] Searching knowledge base for context on '{query_concept}'...")

    return "Relevant Context Found"

    # Running the Architecture

    app = AIChatApp()

    app.register_user("Sarah")

    app.find_context("Anxiety help")

    app.save_conversation("Sarah", "I need help with anxiety")

    Now You Try

    It is time to extend our simulator.

  • Strict SQL Update: Modify the add_user function in Step 1. Add a check to ensure the email string contains an "@" symbol. If it doesn't, reject the data.
  • NoSQL Metadata: In Step 2, modify save_chat_log to automatically count how many items are in the list conversation_data and save that number as a new key message_count inside the document.
  • Vector Expansion: In Step 3, add a new word to the vector_db: "cheetah". Give it coordinates that place it close to "kitten" but slightly different (maybe [0.3, 0.9]). Run the search again with a query of [0.25, 0.85] and see if it picks cheetah or kitten.
  • Challenge Project: The AI Legal Assistant

    You have been hired to design the data architecture for a Law Firm's AI. This AI helps lawyers search through millions of case files.

    The Challenge:

    Write a Python script that defines the Data Schema (the structure) for this application. You don't need to build the full database, just the Python dictionaries that represent how the data would look.

    Requirements:
  • Users (SQL): Needs to store Lawyer Name, Bar Association ID (must be int), and Access Level (Junior/Senior).
  • Case Files (NoSQL): Needs to store messy notes. Some cases have judge's rulings, some have witness transcripts, some have jury notes. The structure varies.
  • Legal Precedents (Vector): Needs to store summaries of old laws. This needs an embedding field (a list of floats) so lawyers can search for "cases similar to theft" without knowing the exact case name.
  • Example Input/Output:

    Your code should output a printed report of your schema design, looking something like this:

    --- AI Legal Assistant Schema Design ---
    
    

    [SQL Structure Example - Users]

    {'name': 'Harvey Specter', 'bar_id': 992211, 'level': 'Senior'}

    [NoSQL Structure Example - Case Notes]

    {'case_id': 'A-100', 'notes': 'Witness was unreliable...', 'evidence_list': ['photo_a.jpg', 'email.txt']}

    [Vector Structure Example - Precedents]

    {'law_name': 'Smith v. Jones', 'summary_embedding': [0.12, 0.004, -0.9...]}

    Hints:

    * Use a dictionary to represent a single row/document for each type.

    * Print these dictionaries to show your design.

    * Think about why "Case Notes" fits NoSQL better than SQL (hint: do all cases have the exact same amount of evidence?).

    What You Learned

    Today you moved from "Coder" to "Architect." You learned that data isn't just variables in memory; it's the long-term memory of your application.

    * SQL (Relational): Best for strict, structured data (Logins, Billing).

    * NoSQL (Document): Best for messy, variable data (Chat logs, JSON blobs).

    * Vector DB: Best for semantic search (Finding "meaning" rather than keywords).

    Why This Matters:

    In the coming days, when we build a real chatbot, we will use these concepts. We will use a Vector Store to let the AI "read" a PDF, and we will use a database to give the AI "memory" of previous conversations.

    Tomorrow: We stop simulating. We will start using the real thing. Tomorrow, we dive into SQL Basics and learn the language of structured data.