Day 63 of 80

Local LLMs with Ollama

Phase 7: Advanced Techniques

What You'll Build Today

Welcome to Day 63! Today marks a significant shift in your journey. Up until now, we have relied entirely on "Cloud AI." We send data to OpenAI or Anthropic, they process it on massive server farms, and send the answer back.

Today, you are going to bring the AI home. You will run a Large Language Model (LLM) entirely on your own computer.

By the end of today, you will have a custom AI assistant running on your local hardware, capable of working offline, costing you exactly $0, and keeping your data 100% private.

Here is what you will master:

* Local Inference: How to run powerful models like Llama 3.1 on consumer hardware so you are not dependent on internet access or API credits.

* Quantization: Understanding how we shrink massive AI models to fit into your laptop's memory without losing too much intelligence.

* Modelfiles: How to "bake" a system prompt into a model permanently, creating specialized tools (like a "Coding Tutor" or "Legal Analyst") that you can share with others.

* Hardware Reality: Learning what your computer can actually handle regarding RAM and GPU usage.

The Problem

Imagine you have been hired by a healthcare startup to process patient intake forms. The forms contain names, medical history, and insurance details. You need an AI to summarize these notes.

You write a script using the OpenAI API, just like we learned in Phase 2.

Here is the code you might write:

import os
# Pretend this is the OpenAI client setup
# from openai import OpenAI 
# client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

patient_data = """
Patient: John Doe
DOB: 01/01/1980
Condition: Reporting severe migraines and sensitivity to light.
History: Family history of hypertension.
"""

print("Sending data to external server...")

# The API call looks like this:
# response = client.chat.completions.create(
#     model="gpt-4o",
#     messages=[
#         {"role": "system", "content": "Summarize this medical record."},
#         {"role": "user", "content": patient_data}
#     ]
# )

# print(response.choices[0].message.content)

The Pain:

Privacy Nightmare: You just sent "John Doe's" medical info to a third-party server. In highly regulated industries (Healthcare/HIPAA, Finance, Legal), this is often illegal or strictly forbidden.

Cost at Scale: If you have 100,000 patient records, and you pay $0.01 per record, you just spent $1,000 to run this script once.

Latency & Reliability: If your internet cuts out, your application dies.

You might be thinking, "There has to be a way to have the intelligence of GPT without the privacy risk of the cloud."

There is. It is called running a Local LLM.

Let's Build It

We are going to use a tool called Ollama. Ollama has revolutionized local AI because it packages complex setup processes into a simple experience, similar to how Docker handles software containers.

Step 1: Installation and CLI Basics

Before we write Python code, we need the engine.

Go to [ollama.com](https://ollama.com) and download the installer for your OS (Mac, Windows, or Linux).

Install it and run the application.

Open your terminal (Command Prompt or Terminal).

We need to pull a model. We will use llama3.1, which is Meta's open-source model. It is incredibly capable and free to use.

Type this in your terminal:

``bash


ollama run llama3.1

Note: The first time you run this, it will download about 4.7GB of data. This is the "brain" of the AI.

Once it finishes, you will see a prompt. Type "Why is the sky blue?" to verify it works. Then type /bye to exit.


Step 2: Python Integration

Now that the engine is running in the background, let's control it with Python. First, install the library:

bash
pip install ollama

Now, create a file named local_bot.py. We will replicate the "Hello World" of local AI.


import ollama

print("Thinking locally... (this relies on your hardware, not the internet)")

response = ollama.chat(model='llama3.1', messages=[
  {
    'role': 'user',
    'content': 'Explain quantum computing in one sentence.',
  },
])

print("\nResponse:")
print(response['message']['content'])


Why this matters: This code looks almost identical to the OpenAI code, but it requires no API key and sends no data outside your machine.

Step 3: Streaming for Better UX

When you run the code above, you might notice a long pause before the text appears. This is because your computer is generating the entire answer in memory before showing it to you. Cloud APIs are fast; your laptop might be slower.

To fix the "frozen" feeling, we use streaming. This prints each word (token) as it is generated.

Update local_bot.py:


import ollama

print("Ask me anything (type 'quit' to exit):")

while True:
    user_input = input("\nYou: ")
    if user_input.lower() == 'quit':
        break

    print("AI: ", end="", flush=True)

    # We enable streaming=True
    stream = ollama.chat(
        model='llama3.1',
        messages=[{'role': 'user', 'content': user_input}],
        stream=True,
    )

    # We iterate over the stream as chunks arrive
    for chunk in stream:
        print(chunk['message']['content'], end='', flush=True)
    
    print() # New line at the end


Why this matters: This makes the application feel responsive, even if your hardware is older.

Step 4: Understanding Quantization

You might wonder: "GPT-4 is massive. How does Llama 3.1 fit on my laptop?"

The answer is Quantization.

Standard models store numbers (weights) as 16-bit or 32-bit floating-point numbers. That takes up a lot of space. Quantization reduces the precision of these numbers, say to 4-bits.

*   FP16 (16-bit): High precision, massive size.
*   Q4 (4-bit): Lower precision, much smaller size, surprisingly similar intelligence.

When you ran ollama run llama3.1, it defaulted to a 4-bit quantized version. This is why it is only 4.7GB instead of 15GB+.


You don't need to write code for this yet, but you need to know that q4 (4-bit) is the standard for running on laptops.


Step 5: Custom Modelfiles

This is the most powerful part of Ollama.

In the cloud, you have to send a "System Prompt" every single time you call the API (e.g., "You are a helpful assistant...").

With Ollama, you can save a System Prompt into a new model. This creates a dedicated tool.

1. Create a file named Modelfile (no extension) in your project folder.

Put this content inside it:

dockerfile
FROM llama3.1

# Set the temperature to be low (very factual)
PARAMETER temperature 0.1

# Bake in the system instruction
SYSTEM """
You are Mario from the Super Mario Bros games. 
You must start every response with "It's a-me, Mario!" 
You use Italian-American slang and refer to the user as 'Player One'.
Keep answers short and enthusiastic.
"""

2. Build the model.

Run this command in your terminal (not Python, your actual terminal):

bash
ollama create mario-bot -f Modelfile

Ollama takes the base Llama 3.1 model, attaches your instructions, and saves it as mario-bot.



3. Run it in Python.

Create mario_test.py:



import ollama

response = ollama.chat(model='mario-bot', messages=[
  {
    'role': 'user',
    'content': 'How do I bake a cake?',
  },
])

print(response['message']['content'])


Output:
"It's a-me, Mario! To bake a cake, Player One, you gotta mix the flour, eggs, and sugar like you're mixing power-ups! Put it in the oven until it rises like a platform! Wahoo!"

Why this matters: You have created a specialized software asset. You can distribute this Modelfile to your team, and everyone will have the exact same "Mario" personality without needing to copy-paste system prompts into their code.

Now You Try

Here are three extensions to solidify your skills:

The Code Reviewer: Create a new Modelfile

 that instructs the model to act as a "Senior Python Architect." It should only accept code as input and return a bulleted list of improvements. Build it as

architect-bot and test it with some messy code.

Try a Different Base: Llama isn't the only model. Go to your terminal and run ollama pull mistral. Change your Python script to use model='mistral'. Observe if the speed or writing style is different.

JSON Mode: Local models can output structured data too. Modify your ollama.chat call by adding the argument format='json'

. Ask the model to "Generate a JSON object for a user named Alice with age 30." Note: You must mention "JSON" in your prompt for this to work reliably.



Challenge Project: The Local vs. Cloud Benchmark

You are going to build a tool to make business decisions. Your boss wants to know: "Is local AI good enough, or do we need to pay for GPT-4?"

Create a script benchmark.py that runs the same prompt on both systems and compares them.



Requirements:
*   Define a complex prompt (e.g., "Write a 200-word email apologizing to a client for a delay caused by a supply chain issue").
*   Record the start time and end time for the Local Llama generation.

* Record the start time and end time for OpenAI gpt-4o-mini (or gpt-3.5-turbo) generation.


*   Print the outputs side-by-side.
*   Print the time taken for each.
   Hint:* Use the python

time module: start = time.time() ... duration = time.time() - start

.

Example Output:

--- Local Llama 3.1 ---
Time taken: 4.2 seconds
Output: Dear Client, I am writing to sincerely apologize...

--- OpenAI GPT-4o-Mini ---
Time taken: 1.1 seconds
Output: Dear Valued Client, We regret to inform you...

--- Conclusion ---
Local model was 3.1 seconds slower.


(Note: You will likely find the Cloud is faster, but Local is free. This is the trade-off you need to understand.)

What You Learned

Today you broke the dependency on the cloud.

* Local Inference: You used ollama to run AI on your own metal.


*   Privacy: You processed data without it ever leaving your laptop.

* Modelfiles: You learned to package prompts and parameters into reusable model definitions using FROM and SYSTEM.

* Quantization: You learned that q4` allows massive intelligence to fit in consumer RAM.

Why This Matters:

In the real world, you cannot always send data to the cloud. Whether it is because of internet connectivity (drones, field work), privacy laws (GDPR, HIPAA), or cost constraints, Local LLMs are rapidly becoming a requirement for enterprise AI development.

Tomorrow: We combine everything. You will build a Fully Local RAG system. You will chat with your own PDF documents without an internet connection and without paying a cent. See you then!

← Day 62 Day 64 →