LLM Evaluation Beyond Vibes

Jan 15, 2026

We have all been there: you tweak a prompt, run it three times, and declare it “better” based on gut feeling. Here is how to do better.

The Problem with Vibes

Manual spot-checking fails because:

LLM outputs are non-deterministic
You remember the good examples, forget the bad ones
Prompts that work for your test case break on edge cases
You cannot track improvements over time

Build an Eval Dataset

Start with 20-50 examples. Each needs:

{
  "input": "What is the capital of France?",
  "expected": "Paris",
  "tags": ["geography", "factual"]
}

For open-ended tasks, use rubrics instead of exact matches:

{
  "input": "Explain TCP vs UDP",
  "rubric": [
    "Mentions TCP is connection-oriented",
    "Mentions UDP is connectionless",
    "Gives a practical use case for each",
    "Does not contain factual errors"
  ]
}

Automated Scoring

Exact Match (for factual questions)

def score_exact(response: str, expected: str) -> float:
    return 1.0 if expected.lower() in response.lower() else 0.0

LLM-as-Judge (for open-ended)

def score_with_llm(response: str, rubric: list[str]) -> float:
    judge_prompt = f"""Score this response against each criterion.
Response: {response}

Criteria:
{chr(10).join(f"- {r}" for r in rubric)}

Return a JSON object with "scores" (list of 0 or 1) and "reasoning"."""

    result = call_llm(judge_prompt)
    return sum(result["scores"]) / len(result["scores"])

Semantic Similarity

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def score_similarity(response: str, reference: str) -> float:
    emb = model.encode([response, reference])
    return float(np.dot(emb[0], emb[1]) /
                 (np.linalg.norm(emb[0]) * np.linalg.norm(emb[1])))

Run Evals Systematically

def run_eval(dataset, prompt_template, model="claude-sonnet-4-5-20250929"):
    results = []
    for example in dataset:
        response = call_llm(prompt_template.format(**example))
        score = score(response, example)
        results.append({
            "input": example["input"],
            "response": response,
            "score": score
        })

    avg_score = sum(r["score"] for r in results) / len(results)
    print(f"Average score: {avg_score:.2%}")
    print(f"Failures: {sum(1 for r in results if r[score] < 0.5)}/{len(results)}")
    return results

Track Over Time

Save results to a file or database:

Date       | Prompt Version | Model      | Score | N
2026-01-10 | v1             | sonnet-4.5 | 72%   | 50
2026-01-12 | v2 (few-shot)  | sonnet-4.5 | 85%   | 50
2026-01-14 | v2             | opus-4.6   | 91%   | 50

Now you can answer “is this prompt better?” with data, not vibes.

Minimum Viable Eval

You do not need a framework. A script with:

A JSON file of test cases
A scoring function
A loop that runs them all and prints aggregate scores

is enough to start. Add complexity when you need it.