Dev Notes

LLM Evaluation Beyond Vibes

We have all been there: you tweak a prompt, run it three times, and declare it “better” based on gut feeling. Here is how to do better.

The Problem with Vibes

Manual spot-checking fails because:

  • LLM outputs are non-deterministic
  • You remember the good examples, forget the bad ones
  • Prompts that work for your test case break on edge cases
  • You cannot track improvements over time

Build an Eval Dataset

Start with 20-50 examples. Each needs:

{
  "input": "What is the capital of France?",
  "expected": "Paris",
  "tags": ["geography", "factual"]
}

For open-ended tasks, use rubrics instead of exact matches:

{
  "input": "Explain TCP vs UDP",
  "rubric": [
    "Mentions TCP is connection-oriented",
    "Mentions UDP is connectionless",
    "Gives a practical use case for each",
    "Does not contain factual errors"
  ]
}

Automated Scoring

Exact Match (for factual questions)

def score_exact(response: str, expected: str) -> float:
    return 1.0 if expected.lower() in response.lower() else 0.0

LLM-as-Judge (for open-ended)

def score_with_llm(response: str, rubric: list[str]) -> float:
    judge_prompt = f"""Score this response against each criterion.
Response: {response}

Criteria:
{chr(10).join(f"- {r}" for r in rubric)}

Return a JSON object with "scores" (list of 0 or 1) and "reasoning"."""

    result = call_llm(judge_prompt)
    return sum(result["scores"]) / len(result["scores"])

Semantic Similarity

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def score_similarity(response: str, reference: str) -> float:
    emb = model.encode([response, reference])
    return float(np.dot(emb[0], emb[1]) /
                 (np.linalg.norm(emb[0]) * np.linalg.norm(emb[1])))

Run Evals Systematically

def run_eval(dataset, prompt_template, model="claude-sonnet-4-5-20250929"):
    results = []
    for example in dataset:
        response = call_llm(prompt_template.format(**example))
        score = score(response, example)
        results.append({
            "input": example["input"],
            "response": response,
            "score": score
        })

    avg_score = sum(r["score"] for r in results) / len(results)
    print(f"Average score: {avg_score:.2%}")
    print(f"Failures: {sum(1 for r in results if r[score] < 0.5)}/{len(results)}")
    return results

Track Over Time

Save results to a file or database:

Date       | Prompt Version | Model      | Score | N
2026-01-10 | v1             | sonnet-4.5 | 72%   | 50
2026-01-12 | v2 (few-shot)  | sonnet-4.5 | 85%   | 50
2026-01-14 | v2             | opus-4.6   | 91%   | 50

Now you can answer “is this prompt better?” with data, not vibes.

Minimum Viable Eval

You do not need a framework. A script with:

  1. A JSON file of test cases
  2. A scoring function
  3. A loop that runs them all and prints aggregate scores

is enough to start. Add complexity when you need it.