The Three-Layer Metric Pyramid for RAG Retrieval Evaluation

Most RAG evaluation frameworks I encounter measure the wrong thing with high precision. Teams run recall at k on a golden set, declare the retriever good enough, ship it, and then wonder six months later why users are not trusting the system. The offline metric was clean. Production tells a different story.

I have built retrieval pipelines for a healthcare data platform, an architecture library generator, and the pattern-matching layer inside Pipeshift (I am the founder, so take that reference with the appropriate grain of salt). Across those systems, the pattern I kept hitting was the same: a single metric layer is always insufficient.

Layer 1: Offline recall on a golden set

The first layer is reproducible and cheap. You build a golden set of query-document pairs, run your retriever against it, and measure recall at k and MRR. This is the layer teams build first and often stop at.

What it catches: regressions in base retrieval quality. When you change your chunking strategy, embedding model, or index config, this is what tells you whether things got better or worse.

What it misses: whether the retrieved context is actually useful to the generation model. A document can be relevant to the query topic but not contain the specific passage needed to answer it. Layer 1 does not distinguish between these cases.

Layer 2: LLM judge on retrieved context quality

The second layer takes your query and the retrieved context and asks an LLM judge whether the retrieved context is sufficient to answer the query. I score on two dimensions: context relevance (does the retrieved set contain information pertinent to the query?) and context sufficiency (is there enough in the retrieved set to construct a complete answer?).

Sufficiency is the more useful signal. In every system I have deployed, mean sufficiency score has been the metric most correlated with user satisfaction -- more than any offline retrieval metric.

What it catches: cases where technically relevant documents are retrieved but the specific passage that answers the query is buried in noise. Cases where the retriever surfaces the right topic but the wrong granularity.

What it misses: what actually happens when a real person uses the system. The LLM judge is evaluating whether the context is good in principle. It cannot tell you whether your users accept the generated answer or route around the system.

Layer 3: User-facing proxies

The third layer is the one most teams skip. It is also the most correlated with user satisfaction.

The three proxies I track: acceptance rate (fraction of generated answers users act on without modification), follow-up rate (how often a session contains a follow-up query immediately after a response -- a signal that the first answer was incomplete), and refinement rate (fraction of times users manually edit or regenerate an answer).

These are noisy signals, hard to attribute specifically to retrieval quality, and subject to confounds from the generation model and UI. They are also the only signals that tell you whether real users in production are getting what they need.

Why you need all three

The layers catch different failure modes.

Offline recall catches model regressions but is blind to whether the retrieved context is generatively useful. LLM judge catches context quality failures but is blind to whether users accept the output. User proxies catch UX-level failures but cannot tell you which part of the system to fix.

I have seen systems with strong offline recall metrics and terrible user acceptance rates. I have seen systems with mediocre LLM judge scores that users loved because the generation layer compensated for retrieval gaps. The only way to have confidence across all three failure modes is to instrument all three.

The practical sequencing: build layer 1 first even if your golden set is imperfect. Before the first production deployment, run a layer 2 judge evaluation on 100 representative queries. Once you have production traffic, start tracking at least one layer 3 proxy -- acceptance rate is the easiest to instrument. Revisit all three at your quarterly eval review.

None of this is complicated. The reason most teams stop at layer 1 is not technical capacity -- it is that layer 2 and 3 require more investment to set up and the payoff is not visible until you have shipped something that is quietly failing in production.