Dev Notes

Running Local LLMs with Ollama

Not everything needs to hit a cloud API. For development, testing, and privacy-sensitive tasks, running models locally makes sense.

Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Or on macOS
brew install ollama

Pull and Run a Model

# Start the service
ollama serve

# Pull a model (one-time download)
ollama pull llama3.1:8b      # 4.7 GB
ollama pull codellama:13b    # 7.4 GB
ollama pull mistral:7b       # 4.1 GB
ollama pull qwen2.5:14b      # 9.0 GB

# Chat
ollama run llama3.1:8b

Use as API

Ollama exposes an OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "user", "content": "Explain CORS in 3 sentences"}
    ]
)
print(response.choices[0].message.content)

This means any code using the OpenAI SDK works with local models by changing two lines.

Model Selection Guide

ModelSizeBest ForSpeed (M2 Pro)
Llama 3.1 8B4.7 GBGeneral chat, summarization~40 tok/s
CodeLlama 13B7.4 GBCode generation, review~25 tok/s
Mistral 7B4.1 GBFast general tasks~45 tok/s
Qwen 2.5 14B9.0 GBChinese + English bilingual~20 tok/s
Phi-3 Mini2.2 GBQuick tasks on limited RAM~60 tok/s

Practical Use Cases

Local RAG Development

Test your RAG pipeline without API costs:

ollama pull nomic-embed-text  # Local embeddings
ollama pull llama3.1:8b       # Local generation

Commit Message Generation

git diff --cached | ollama run llama3.1:8b "Write a concise commit message for this diff:"

Code Review in CI

Run a local model in CI to catch obvious issues without API keys:

- name: AI Code Review
  run: |
    git diff origin/main | ollama run codellama:13b \
      "Review this diff for bugs, security issues, and style problems. Be concise."    

Limitations

  • Quality gap with cloud models (Claude, GPT-4) is real for complex reasoning
  • 8B models struggle with multi-step logic
  • No tool use support in most local models
  • GPU recommended for anything above 7B parameters

Use local models for development iteration and privacy-sensitive tasks. Use cloud models for production quality.