Running Local LLMs with Ollama

Sep 10, 2025

Not everything needs to hit a cloud API. For development, testing, and privacy-sensitive tasks, running models locally makes sense.

Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Or on macOS
brew install ollama

Pull and Run a Model

# Start the service
ollama serve

# Pull a model (one-time download)
ollama pull llama3.1:8b      # 4.7 GB
ollama pull codellama:13b    # 7.4 GB
ollama pull mistral:7b       # 4.1 GB
ollama pull qwen2.5:14b      # 9.0 GB

# Chat
ollama run llama3.1:8b

Use as API

Ollama exposes an OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "user", "content": "Explain CORS in 3 sentences"}
    ]
)
print(response.choices[0].message.content)

This means any code using the OpenAI SDK works with local models by changing two lines.

Model Selection Guide

Model	Size	Best For	Speed (M2 Pro)
Llama 3.1 8B	4.7 GB	General chat, summarization	~40 tok/s
CodeLlama 13B	7.4 GB	Code generation, review	~25 tok/s
Mistral 7B	4.1 GB	Fast general tasks	~45 tok/s
Qwen 2.5 14B	9.0 GB	Chinese + English bilingual	~20 tok/s
Phi-3 Mini	2.2 GB	Quick tasks on limited RAM	~60 tok/s

Practical Use Cases

Local RAG Development

Test your RAG pipeline without API costs:

ollama pull nomic-embed-text  # Local embeddings
ollama pull llama3.1:8b       # Local generation

Commit Message Generation

git diff --cached | ollama run llama3.1:8b "Write a concise commit message for this diff:"

Code Review in CI

Run a local model in CI to catch obvious issues without API keys:

- name: AI Code Review
  run: |
    git diff origin/main | ollama run codellama:13b \
      "Review this diff for bugs, security issues, and style problems. Be concise."

Limitations

Quality gap with cloud models (Claude, GPT-4) is real for complex reasoning
8B models struggle with multi-step logic
No tool use support in most local models
GPU recommended for anything above 7B parameters

Use local models for development iteration and privacy-sensitive tasks. Use cloud models for production quality.