Day 26 of 80

Open Source Models via APIs

Phase 3: LLM Landscape & APIs

What You'll Build Today

Welcome to Day 26. Today marks a significant shift in your journey. Up until now, we have relied almost exclusively on OpenAI's GPT models. They are powerful, smart, and easy to use. But they are also a "walled garden." You don't know exactly how they work, your data leaves your control, and the costs can scale up quickly.

Today, we break out of the walled garden. We are entering the world of Open Source models—specifically Meta's Llama 3 and Mistral.

You are going to build a "Speed Chat" application. While GPT-4 is smart, it can be slow. Today, you will use a provider called Groq that runs open-source models on specialized hardware, delivering results so fast it feels like the AI is reading your mind before you finish typing.

Here is what you will learn and why it matters:

* The Open Source Landscape: You will learn the difference between proprietary models (GPT-4, Claude) and open weights models (Llama 3, Mistral), giving you the freedom to choose the right tool for the job.

* Hosted Inference Providers: You will learn how to use platforms like Groq, Together AI, or Fireworks. These services host massive open-source models for you, so you don't need a \$30,000 computer to run them.

* OpenAI Compatibility: You will learn the industry standard pattern for APIs. This is a superpower that allows you to swap out the "brain" of your application without rewriting your code.

* Latency vs. Intelligence: You will learn how to make tradeoffs. Sometimes you need a genius (GPT-4), but sometimes you just need a fast, capable assistant (Llama 3 on Groq).

Let's get started.

The Problem

Imagine you have built a customer service bot for a client using OpenAI's GPT-4. It works great during testing. But then, two things happen.

First, the client looks at the bill. Every time a user says "Hello," it costs a fraction of a cent. With 10,000 users a day, that bill is now hundreds of dollars a month.

Second, the client's legal team calls you. They say, "We cannot send our private customer data to OpenAI to be used for model training. We need a private model."

You panic. You look at the open-source world and find a different provider. You look at their documentation, and it looks like this:

# Hypothetical code for "BadProvider AI"

import bad_provider

# This looks nothing like the OpenAI code you spent weeks learning!

response = bad_provider.generate_text(

prompt_text="Hello",

settings={

"temp": 0.7,

"max_len": 100

},

auth_token="xyz"

)

print(response['data']['result']['text'])

This is the pain point.

  • Vendor Lock-in: If you write your entire application specifically for OpenAI's library, switching to a cheaper or more private model means rewriting your entire codebase.
  • Cost: You are paying premium prices for simple tasks. You don't need Einstein to summarize a 50-word email, but you are paying for Einstein anyway.
  • Speed: Standard APIs can be sluggish. Waiting 5 seconds for a simple response kills user engagement.
  • There has to be a way to switch models instantly, reduce costs by 90%, and increase speed by 10x, all without rewriting your application logic.

    Let's Build It

    The solution lies in OpenAI-Compatible Endpoints.

    The industry has largely agreed that the way OpenAI structures their code (chat, completions, messages) is the standard. Because of this, providers like Groq, Together AI, and Ollama (for running local models) have built their APIs to look exactly like OpenAI's.

    We are going to use Groq. Groq is a hardware company that created the LPU (Language Processing Unit). It runs Large Language Models (LLMs) incredibly fast. We will access the Llama 3 model via Groq, but we will use the standard Python tools you already know.

    Step 1: Setup and API Keys

    First, you need a Groq API key.

  • Go to console.groq.com.
  • Sign up (it is currently free to start).
  • Create an API Key.
  • Save it in your .env file as GROQ_API_KEY.
  • Next, we need the library. Surprisingly, we don't strictly need a groq library. We can use the openai library we already have! However, to be safe and ensure we have the latest dependencies, let's install/upgrade the openai package.

    ``bash

    pip install --upgrade openai python-dotenv

    `

    Now, create a file named speed_chat.py. We will start by importing our libraries and setting up the client.

    Crucial Concept: We are initializing the OpenAI client, but we are tricking it. Instead of pointing it to OpenAI's servers, we are pointing it to Groq's servers using the base_url parameter.
    import os
    

    import time

    from dotenv import load_dotenv

    from openai import OpenAI

    # Load environment variables

    load_dotenv()

    # Get the Groq API key

    api_key = os.getenv("GROQ_API_KEY")

    if not api_key:

    print("Error: GROQ_API_KEY not found in .env file")

    exit()

    # THE MAGIC TRICK: # We use the standard OpenAI client, but we change the base_url. # This tells the library: "Use the OpenAI format, but send the data to Groq."

    client = OpenAI(

    api_key=api_key,

    base_url="https://api.groq.com/openai/v1"

    )

    print("Client initialized successfully linked to Groq!")

    Step 2: Your First Llama 3 Call

    Now let's verify it works. We will ask a simple question. Note that when we select the model, we cannot ask for gpt-4 because Groq doesn't host GPT-4. We must ask for a model Groq actually has, like llama3-8b-8192 (Llama 3, 8 billion parameters).

    Add this to your code:

    try:
    

    print("\nSending request to Llama 3...")

    completion = client.chat.completions.create(

    model="llama3-8b-8192",

    messages=[

    {"role": "system", "content": "You are a helpful assistant."},

    {"role": "user", "content": "Explain quantum computing in one sentence."}

    ]

    )

    print("\nResponse:")

    print(completion.choices[0].message.content)

    except Exception as e:

    print(f"An error occurred: {e}")

    Run this code. You should see a response almost instantly.

    Step 3: Measuring the Speed

    The main reason we are using Groq today is speed. But "fast" is subjective. As an engineer, you need to measure it.

    We will wrap our call in a timer. We will measure the "wall clock time"—how long it takes from the moment we send the request to the moment we get the full answer.

    Replace the previous request code with this detailed measurement block:

    print("\n--- Speed Test ---")
    

    prompt = "Write a short poem about a robot running very fast."

    start_time = time.time()

    completion = client.chat.completions.create(

    model="llama3-8b-8192",

    messages=[

    {"role": "user", "content": prompt}

    ]

    )

    end_time = time.time()

    duration = end_time - start_time

    content = completion.choices[0].message.content

    token_count = completion.usage.completion_tokens

    print(f"Prompt: {prompt}")

    print(f"\nResponse:\n{content}\n")

    print("-" * 30)

    print(f"Time taken: {duration:.4f} seconds")

    print(f"Tokens generated: {token_count}")

    print(f"Speed: {token_count / duration:.2f} tokens per second")

    Why this matters: OpenAI's GPT-4 usually runs at roughly 20-40 tokens per second. When you run this code with Groq, you might see speeds of 300, 500, or even 800 tokens per second. That is the difference between reading a book and having the book downloaded into your brain.

    Step 4: The Speed Chat Loop

    Now, let's turn this into a proper chatbot. We want a continuous loop where we can chat with Llama 3. Because it is so fast, it feels much more conversational than older models.

    We will also add a system prompt to give Llama a personality. Llama 3 is very obedient to system prompts.

    def start_speed_chat():
        # 1. Define the system personality
    

    system_prompt = {

    "role": "system",

    "content": "You are a concise, witty AI. You answer quickly and get straight to the point."

    }

    chat_history = [system_prompt]

    print("\n=== Llama 3 Speed Chat (Type 'quit' to exit) ===")

    while True:

    # 2. Get user input

    user_input = input("\nYou: ")

    if user_input.lower() in ['quit', 'exit']:

    print("Goodbye!")

    break

    # 3. Add user message to history

    chat_history.append({"role": "user", "content": user_input})

    try:

    start_ts = time.time()

    # 4. Make the API Call

    response = client.chat.completions.create(

    model="llama3-8b-8192",

    messages=chat_history,

    temperature=0.7,

    max_tokens=1024

    )

    end_ts = time.time()

    # 5. Extract and print response

    ai_message = response.choices[0].message.content

    print(f"Llama ({end_ts - start_ts:.3f}s): {ai_message}")

    # 6. Add AI response to history so it remembers context

    chat_history.append({"role": "assistant", "content": ai_message})

    except Exception as e:

    print(f"Error: {e}")

    # Run the function

    if __name__ == "__main__":

    start_speed_chat()

    Common Mistakes

  • Wrong Model Name: If you try to use gpt-3.5-turbo with the Groq URL, it will fail. The base_url determines where the request goes, but the model name must exist on that server.
  • Environment Variables: Forgetting to change the variable name from OPENAI_API_KEY to GROQ_API_KEY in your .env file (or pointing the client to the wrong one).
  • Context Window: Llama 3 has a limit on how much text it can remember. If you chat for hours, eventually the chat_history list will be too big for the model. In production, you would trim the oldest messages.
  • Now You Try

    You have a working speed chat. Now let's push the boundaries.

  • The Model Swap:
  • Groq hosts other models besides Llama 3. They also host Mistral (another famous open-source model from France).

    * Find the model ID for Mistral (Mixtral-8x7b) in the Groq documentation.

    * Modify your code to ask the user at the start: "Choose your fighter: (1) Llama 3 or (2) Mistral".

    * Pass the chosen model string to the API call.

  • The Verbosity Slider:
  • Llama 3 is very sensitive to system prompts.

    * Add a setup step where the user can choose a "personality mode": Concise, Verbose, or EL15 (Explain Like I'm 5).

    * Change the system_prompt content based on this choice before the loop starts.

  • The Cost Estimator:
  • Even though Groq is free right now (or very cheap), let's practice tracking usage.

    * Create a variable total_tokens = 0.

    * After every response, add response.usage.total_tokens to your counter.

    * When the user quits, print: "Total session tokens used: X".

    Challenge Project: The "Drop-in" Test

    The ultimate test of the "OpenAI Compatible" concept is to take code that was written specifically for OpenAI and make it run on Groq with minimal changes.

    The Goal:

    Take your chatbot script from Day 22 (The CLI Chatbot with History). Make it run on Groq's Llama 3 model.

    Requirements:
  • Locate your Day 22 Python file.
  • You are allowed to change only the lines involving the client initialization and the model name.
  • You must not change the logic of the loop, the way messages are appended, or how the response is printed.
  • The bot must function exactly as it did before, but now running at lightning speed on open-source hardware.
  • Example of the only changes allowed: Old Code:
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    # ... later ...
    

    model="gpt-3.5-turbo"

    New Code:
    client = OpenAI(
    

    api_key=os.getenv("GROQ_API_KEY"),

    base_url="https://api.groq.com/openai/v1"

    )

    # ... later ...

    model="llama3-8b-8192"

    Hints:

    * If your Day 22 code didn't use .env files properly, you might need to fix that first.

    * If your Day 22 code used specific OpenAI features like "Functions" or "JSON Mode," they might behave slightly differently, but standard chat will work perfectly.

    What You Learned

    Today you broke the chains of vendor lock-in. You learned:

  • Base URL Magic: By changing the base_url`, you can point standard OpenAI client libraries to other providers like Groq, Together AI, or even a local server running on your laptop.
  • Open Source Power: You experienced Llama 3, a model that rivals proprietary models but is open for anyone to use.
  • Inference Speed: You saw the difference between "thinking time" (latency) and "generation speed" (throughput), and how specialized hardware (Groq LPU) impacts this.
  • Why This Matters:

    In the real world, you rarely stick with one provider forever. Prices change. Models improve. Privacy requirements tighten. By building your applications using compatible patterns, you make your software "future-proof." You can switch from OpenAI to Llama to Mistral in five minutes, not five weeks.

    Tomorrow: Now that you have access to multiple models, how do you know which one to use? Tomorrow we cover Multi-provider patterns. We will build a "Router" that sends easy questions to the cheap, fast model (Llama 3) and hard questions to the smart, expensive model (GPT-4) automatically.