I Built an LLM Eval Harness From Scratch. Here Is What That Cost Me and What I'd Do Differently.

The honest version of "why I built my own eval harness instead of using an off-the-shelf tool" is not ideological. I did not build it because I think NIH is virtuous or because I distrust existing tools. I built it because the abstraction the existing tools offered was wrong for the specific thing I needed to measure.

That distinction matters. A lot of "I built my own X" posts are secretly "I did not research what existed." This is not that. I evaluated Promptfoo, DeepEval, and an early version of Ragas before writing a line of code. The problem I hit: all three organized evals around the model call as the unit of measurement. Score a prompt/response pair. Run that across a test set. Report accuracy, faithfulness, relevance.

My system was not a model. It was a pipeline -- ingest, preprocess, retrieve, rerank, generate, post-process -- and the failure modes I was seeing were not in the generation step. They were in retrieval. A changed chunk boundary silently degraded answer quality three steps later. A new embedding model improved cosine similarity scores while producing worse end-to-end answers. None of the off-the-shelf tools had a first-class concept of "evaluate the system at the output boundary while tracing degradation to the step that caused it."

So I wrote the harness. It is about 400 lines of Python. This is what it does, where it still has rough edges, and what I would do differently if I were starting today.

What the harness actually measures

The harness evaluates pipeline outputs against a versioned golden dataset. A golden example is a tuple of (input query, expected output, optional metadata about the expected retrieval context). The key design decision: I store expected retrieval context alongside expected answers. This lets the harness check not just "was the answer correct" but "was the answer correct for the right reasons."

That second check catches a class of failure that output-only eval misses entirely: the pipeline generates a correct answer via hallucination or a lucky retrieval hit. If your golden dataset only has expected outputs, a correct answer is a correct answer. If it also has expected context, you can detect when the answer is right but the retrieved context is wrong -- which matters because next week's slightly different query will fail the same way.

Golden examples are YAML files versioned in Git alongside the code. Each has a schema version field:

schema_version: "1"
query: "What is the data retention policy for OCI Object Storage audit logs?"
expected_answer_contains:
  - "90 days"
  - "audit log"
expected_context_source_ids:
  - "oci-security-guide::section-14"
  - "oci-logging-docs::section-3"
metadata:
  domain: "oci-security"
  difficulty: "medium"
  added: "2026-01-15"
  added_by: "mohak"

The expected_answer_contains check is a soft match, not exact string equality. I use it to specify necessary conditions rather than exact outputs -- the answer must contain "90 days" and must contain "audit log" without specifying the exact phrasing. Exact string matching on LLM outputs is a losing game. Necessary-condition matching is stricter than embedding similarity alone and flexible enough to survive prompt wording changes.

The expected_context_source_ids check runs against the retrieval layer's output, before generation. A test can pass this check and fail the answer check (good retrieval, bad generation), or fail this check and pass the answer check (bad retrieval, good output -- the suspicious case).

LLM-as-judge with a structured rubric

For the dimensions that necessary-condition matching cannot cover -- factual accuracy against the source document, completeness, absence of hallucinated details -- I use an LLM-as-judge pass with a structured scoring rubric.

The rubric has four dimensions: factual correctness, answer completeness, groundedness in retrieved context, and absence of unsupported claims. Each dimension scores 0-2. The judge model outputs a JSON object with a score per dimension and a one-sentence rationale per dimension. Using structured output forces the rationale to be specific -- "unsupported claim about 180-day retention period not found in source documents" is useful feedback; a freeform paragraph that averages out to a score is not.

from pydantic import BaseModel

class DimensionScore(BaseModel):
    score: int  # 0, 1, or 2
    rationale: str

class JudgeOutput(BaseModel):
    factual_correctness: DimensionScore
    completeness: DimensionScore
    groundedness: DimensionScore
    no_unsupported_claims: DimensionScore

    @property
    def total(self) -> int:
        return (
            self.factual_correctness.score
            + self.completeness.score
            + self.groundedness.score
            + self.no_unsupported_claims.score
        )

The judge model is gpt-4o-mini at the time I'm writing this. Not gpt-4o, not a frontier model. The scoring rubric is doing most of the work; the model just needs to follow structured instructions and read the source document. Using gpt-4o-mini keeps the per-eval cost low enough that I run the full suite on every pull request without it being a line item.

At roughly 1,500 input tokens per eval (query + retrieved context + generated answer + rubric) and about 200 output tokens, a run of 80 golden examples costs approximately $0.015 at current gpt-4o-mini pricing. The suite runs in about 90 seconds with basic parallelism. Both numbers are acceptable for a CI gate.

One thing to be honest about: LLM-as-judge has known consistency issues. The same inputs can produce slightly different scores across runs, and the rubric calibration is my judgment call rather than ground truth. I treat judge scores as signals, not verdicts. A test "fails" only when the total score drops below a threshold AND the factual_correctness or no_unsupported_claims dimension specifically scores zero -- the dimensions I care most about. A completeness score of 1 instead of 2 is a warning, not a gate.

Regression detection

The regression detection is the part that makes the harness worth using versus running eval once. Every run writes its scores to a SQLite database alongside a commit SHA, timestamp, and a content hash of the pipeline configuration. The content hash covers the embedding model name, chunk strategy parameters, retriever top-K, reranker model, and generation prompt template. Any change to those parameters produces a different hash.

def compute_pipeline_fingerprint(config: PipelineConfig) -> str:
    import hashlib, json
    payload = {
        "embedding_model": config.embedding_model,
        "chunk_strategy": config.chunk_strategy,
        "chunk_size": config.chunk_size,
        "chunk_overlap": config.chunk_overlap,
        "top_k": config.retriever_top_k,
        "reranker": config.reranker_model,
        "prompt_template_hash": hashlib.sha256(
            config.prompt_template.encode()
        ).hexdigest()[:16],
    }
    return hashlib.sha256(json.dumps(payload, sort_keys=True).encode()).hexdigest()[:12]

The CI check compares the current run's per-example scores against the baseline for the same pipeline fingerprint. If any golden example drops more than one point on the total score compared to baseline, the check flags it for review. If three or more examples regress, the check fails the PR.

The thresholds are arbitrary and have been tuned once. They are too loose in some cases -- a one-point drop on a high-stakes example is worth catching even in isolation -- but the current setup has produced zero false-positive CI failures across about four months of use, which I value more than theoretical precision.

The CI integration

The harness runs as a GitHub Actions step triggered on pull requests that touch the pipeline code or the golden dataset. It does not run on documentation or frontend changes. The step is gated behind a secret for the judge model API key, which means it does not run on PRs from external forks -- acceptable given this is internal infrastructure.

The GitHub Actions step output includes a per-example score table and a regression summary. Failed examples get their judge rationale printed directly to the step output so the reviewer does not have to dig into artifacts to understand what broke.

This took about a day to wire up, most of which was getting the SQLite state to persist across runs using GitHub Actions cache. The naive approach -- writing results to an artifact and downloading it at the start of the next run -- was too slow and too fragile. The cache approach works but has a race condition if two PRs run the eval suite simultaneously and both try to write a new baseline. I have not fixed this because it has not bitten me yet. It is the messiest part of the implementation.

The 20% that is still messy

The 400 lines that handle the core eval loop are clean. The surrounding 20% that handles everything else is not.

Golden dataset maintenance. Adding new examples is easy. Retiring examples when the system's scope changes is not. There is no principled way to decide when an old golden example is testing something the system no longer needs to do versus when it is testing a regression that has not been caught yet. I currently rely on human judgment (mine) to prune the dataset, which means it has grown to 80 examples that include some that are testing things I have already deprecated.

The judge model's blind spots. The LLM-as-judge setup scores within the retrieved context. It cannot detect errors introduced because the retrieval step missed a relevant document entirely. A complete retrieval miss looks like "high groundedness, low completeness" -- the answer is well-grounded in what was retrieved, it is just incomplete because retrieval was wrong. I know this is a gap. The workaround is the expected_context_source_ids check described earlier. It is not a complete fix.

No distributed tracing. The harness knows the pipeline's output and the retrieval layer's outputs. It does not have visibility into latency per stage, token counts per stage, or why the reranker scored a particular chunk highly. I log these to stdout and parse them in a post-hoc notebook when I need to debug a regression. This works until it does not. The right solution is proper tracing instrumentation.

What I would do differently starting today

I would use Langfuse for tracing from day one.

When I started building the harness, Langfuse's self-hosted offering was less mature and the Python SDK had more rough edges. That is no longer true. Langfuse handles the tracing, span-level latency, token counting, and cost attribution that I am currently doing badly with print statements. It also has a dataset management UI that would solve the golden example pruning problem I described above.

What I would keep custom: the eval logic itself. The LLM-as-judge rubric, the regression comparison against baseline, the CI gate thresholds -- these encode opinions about what matters for my specific system. Off-the-shelf eval frameworks want to provide the rubric. That is the wrong layer of abstraction to outsource, because what "correctness" means for a security architecture document retrieval system is not the same as what it means for a customer support chatbot.

So the architecture I would use today: Langfuse for tracing and dataset storage, custom Python (~300 lines, I could trim the current harness) for the eval scoring logic and CI integration. The Langfuse SDK calls replace my bespoke logging. The eval logic stays mine.

I might be wrong about keeping the eval logic custom. Promptfoo has gotten meaningfully more flexible since I last evaluated it, and Hamel Husain's writing on LLM evals (his fastai.fast.ai evals series in particular) has influenced how I think about when custom code adds versus costs. Worth re-evaluating before assuming the custom approach is always the right call.

The mistake I see most often

Engineers evaluating LLM systems spend their time evaluating the model. They swap GPT-4o for Claude 3.5 Sonnet, run a benchmark, pick the better one, ship. The model choice is real but it is rarely the dominant variable. The dominant variables are usually the retrieval step, the chunking strategy, and the prompt structure -- none of which a model benchmark tests.

The tell is when someone shows me eval results that are response-level scores with no information about what was retrieved. Those scores tell you how well the model performed given the retrieved context. They tell you nothing about whether the retrieved context was any good. The system I have described here measures both -- that is not an accident of design, it is the design.

If you run a model swap, hold everything else constant, and your system quality improves by 15%, the model probably mattered. If the system quality is flat or inconsistent, check retrieval before blaming the model. In my experience, that is the right prior more than half the time.

The eval harness described here runs against the pipeline components I use in consulting engagements and in the infrastructure I build for Pipeshift -- I am the founder, so take that mention with appropriate context. The golden dataset design is specific to RAG systems; the regression detection and CI integration pattern applies broadly. If you are building something similar and want to compare notes, the contact page has my email.