Running Local LLMs with Ollama
Not everything needs to hit a cloud API. For development, testing, and privacy-sensitive tasks, running models locally makes sense.
Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Or on macOS
brew install ollama
Pull and Run a Model
# Start the service
ollama serve
# Pull a model (one-time download)
ollama pull llama3.1:8b # 4.7 GB
ollama pull codellama:13b # 7.4 GB
ollama pull mistral:7b # 4.1 GB
ollama pull qwen2.5:14b # 9.0 GB
# Chat
ollama run llama3.1:8b
Use as API
Ollama exposes an OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[
{"role": "user", "content": "Explain CORS in 3 sentences"}
]
)
print(response.choices[0].message.content)
This means any code using the OpenAI SDK works with local models by changing two lines.
Model Selection Guide
| Model | Size | Best For | Speed (M2 Pro) |
|---|---|---|---|
| Llama 3.1 8B | 4.7 GB | General chat, summarization | ~40 tok/s |
| CodeLlama 13B | 7.4 GB | Code generation, review | ~25 tok/s |
| Mistral 7B | 4.1 GB | Fast general tasks | ~45 tok/s |
| Qwen 2.5 14B | 9.0 GB | Chinese + English bilingual | ~20 tok/s |
| Phi-3 Mini | 2.2 GB | Quick tasks on limited RAM | ~60 tok/s |
Practical Use Cases
Local RAG Development
Test your RAG pipeline without API costs:
ollama pull nomic-embed-text # Local embeddings
ollama pull llama3.1:8b # Local generation
Commit Message Generation
git diff --cached | ollama run llama3.1:8b "Write a concise commit message for this diff:"
Code Review in CI
Run a local model in CI to catch obvious issues without API keys:
- name: AI Code Review
run: |
git diff origin/main | ollama run codellama:13b \
"Review this diff for bugs, security issues, and style problems. Be concise."
Limitations
- Quality gap with cloud models (Claude, GPT-4) is real for complex reasoning
- 8B models struggle with multi-step logic
- No tool use support in most local models
- GPU recommended for anything above 7B parameters
Use local models for development iteration and privacy-sensitive tasks. Use cloud models for production quality.