LLM Evaluation Beyond Vibes
We have all been there: you tweak a prompt, run it three times, and declare it “better” based on gut feeling. Here is how to do better.
The Problem with Vibes
Manual spot-checking fails because:
- LLM outputs are non-deterministic
- You remember the good examples, forget the bad ones
- Prompts that work for your test case break on edge cases
- You cannot track improvements over time
Build an Eval Dataset
Start with 20-50 examples. Each needs:
{
"input": "What is the capital of France?",
"expected": "Paris",
"tags": ["geography", "factual"]
}
For open-ended tasks, use rubrics instead of exact matches:
{
"input": "Explain TCP vs UDP",
"rubric": [
"Mentions TCP is connection-oriented",
"Mentions UDP is connectionless",
"Gives a practical use case for each",
"Does not contain factual errors"
]
}
Automated Scoring
Exact Match (for factual questions)
def score_exact(response: str, expected: str) -> float:
return 1.0 if expected.lower() in response.lower() else 0.0
LLM-as-Judge (for open-ended)
def score_with_llm(response: str, rubric: list[str]) -> float:
judge_prompt = f"""Score this response against each criterion.
Response: {response}
Criteria:
{chr(10).join(f"- {r}" for r in rubric)}
Return a JSON object with "scores" (list of 0 or 1) and "reasoning"."""
result = call_llm(judge_prompt)
return sum(result["scores"]) / len(result["scores"])
Semantic Similarity
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
def score_similarity(response: str, reference: str) -> float:
emb = model.encode([response, reference])
return float(np.dot(emb[0], emb[1]) /
(np.linalg.norm(emb[0]) * np.linalg.norm(emb[1])))
Run Evals Systematically
def run_eval(dataset, prompt_template, model="claude-sonnet-4-5-20250929"):
results = []
for example in dataset:
response = call_llm(prompt_template.format(**example))
score = score(response, example)
results.append({
"input": example["input"],
"response": response,
"score": score
})
avg_score = sum(r["score"] for r in results) / len(results)
print(f"Average score: {avg_score:.2%}")
print(f"Failures: {sum(1 for r in results if r[score] < 0.5)}/{len(results)}")
return results
Track Over Time
Save results to a file or database:
Date | Prompt Version | Model | Score | N
2026-01-10 | v1 | sonnet-4.5 | 72% | 50
2026-01-12 | v2 (few-shot) | sonnet-4.5 | 85% | 50
2026-01-14 | v2 | opus-4.6 | 91% | 50
Now you can answer “is this prompt better?” with data, not vibes.
Minimum Viable Eval
You do not need a framework. A script with:
- A JSON file of test cases
- A scoring function
- A loop that runs them all and prints aggregate scores
is enough to start. Add complexity when you need it.