The standard advice on RAG evaluation is to "use RAGAS and check your metrics." That advice is not wrong. It is also not sufficient, because RAGAS ships with ten-plus metrics and nobody tells you which three matter and which four are noise for your specific use case. I spent most of 2024 running every metric and getting charts I couldn't act on.
This is what my evaluation pipeline looks like now, two years in, after dropping the metrics that didn't predict anything useful and building the tooling that actually caught regressions before they hit users.
What changed from 2024
In 2024 my evaluation setup was: retrieve some chunks, run them through the RAGAS suite, look at a dashboard, feel vaguely okay about the numbers. The problems with that approach surfaced slowly.
The first problem: I had no golden dataset. I was evaluating against synthetic queries generated by the same LLM I was using for generation. This is circular. The model that tends to hallucinate also tends to rate its own hallucinations as grounded. My faithfulness scores looked good because my judge and my generator shared the same failure modes.
The second problem: I was running evaluation as a one-off audit rather than a regression gate. I would evaluate a new retrieval configuration, see better numbers, deploy, and not know whether the improvement held two weeks later when the corpus changed.
The third problem: I had no connection between my evaluation metrics and the signal that actually mattered -- whether users found answers useful. I was optimizing metrics in a vacuum.
In 2026 the pipeline has three components that address all of this: a golden dataset I trust, a trimmed metric set with a clear hierarchy, and a Langfuse-based trace layer that connects evaluations to real user sessions. I will walk through each.
Building a golden dataset I actually trust
The temptation with golden datasets is to generate everything with an LLM and ship it. It is fast and produces a large dataset. It is also the reason many evaluation setups are measuring model agreement rather than actual system quality.
My current approach is LLM-assisted with mandatory human spot-check at a fixed sampling rate.
The generation phase uses a seeded prompt that instructs the LLM to produce queries of specific types: factual lookups, multi-hop reasoning questions, negation questions ("what does the system NOT support"), and adversarial questions designed to tempt retrieval toward plausible but wrong chunks. For a typical corpus of a few hundred source documents this produces 200-400 raw question-answer pairs.
import openai
import json
from dataclasses import dataclass, asdict
from typing import Literal
@dataclass
class GoldenSample:
question: str
ground_truth: str
source_doc_ids: list[str]
question_type: Literal["factual", "multi_hop", "negation", "adversarial"]
generated_by: str # model version -- important for auditing
human_verified: bool = False
SEED_PROMPT = """
You are constructing evaluation data for a RAG system.
Given the document excerpt below, generate ONE question and a ground-truth answer.
Question type: {question_type}
Document excerpt:
{excerpt}
Return JSON with keys: question, ground_truth, source_doc_id
The ground truth must be answerable solely from the excerpt.
For adversarial questions, phrase the question so a naive retriever
would return a plausible-but-wrong document.
"""
def generate_sample(
client: openai.OpenAI,
excerpt: str,
doc_id: str,
question_type: str,
model: str = "gpt-4o-mini",
) -> GoldenSample | None:
try:
resp = client.chat.completions.create(
model=model,
response_format={"type": "json_object"},
messages=[
{
"role": "user",
"content": SEED_PROMPT.format(
question_type=question_type,
excerpt=excerpt,
),
}
],
temperature=0.3,
)
data = json.loads(resp.choices[0].message.content)
return GoldenSample(
question=data["question"],
ground_truth=data["ground_truth"],
source_doc_ids=[doc_id],
question_type=question_type,
generated_by=model,
)
except (KeyError, json.JSONDecodeError):
return None
The spot-check protocol: I sample 15% of generated pairs stratified by question type. A human reviewer (me, or a domain expert on client engagements) reads each sampled pair and marks it as valid, invalid, or needs-edit. Invalid pairs are discarded. Needs-edit pairs are corrected and marked human_verified=True. The rest stay in the dataset marked unverified.
At inference time, I weight metrics computed on human-verified samples 3x relative to unverified samples when computing aggregate scores. This is a rough heuristic but it means that my headline numbers are anchored to the samples I actually trust.
The dataset lives in a versioned JSON file checked into the evaluation repo. Every retrieval configuration change is evaluated against the same frozen dataset version. When the corpus changes significantly I generate a new dataset version and run both in parallel for a period so I can see whether the corpus change itself altered the baseline.
The metrics I use and the ones I dropped
RAGAS 0.1.x ships with: faithfulness, answer relevancy, context recall, context precision, context entity recall, noise sensitivity, answer correctness, answer similarity, context relevancy. That is nine metrics. I use three. Here is why.
Context precision: Are the retrieved chunks actually relevant to the question? This is my primary retrieval health metric. A drop in context precision is the first signal that my retriever is degrading -- either because the query routing changed, the embedding model drifted, or the corpus added noisy documents that are confusing similarity search. I watch this metric per question type, not just in aggregate, because adversarial questions behave differently from factual ones.
Faithfulness: Does the generated answer contain only claims that are grounded in the retrieved context? This is my primary generation quality metric. It catches the most dangerous failure mode: the system producing a fluent, confident answer that contradicts or extends beyond its retrieved context. Faithfulness correlates most closely with user satisfaction in my experience -- I will explain why below.
Context recall: Is the relevant information actually present somewhere in what we retrieved? This catches retrieval gaps rather than retrieval noise. Low context recall means we are missing the answer even when we are asking the right question. This diverges from context precision diagnostically: precision problems mean we retrieved junk, recall problems mean we missed the signal.
The metrics I dropped and why:
Answer correctness: Compares the generated answer against the ground truth using both string overlap and semantic similarity. Sounds important. In practice it is dominated by surface-form similarity and flagged failures on questions where the answer was substantively right but phrased differently. It also requires the ground truth to be in the same format as the generated answer, which breaks for long-form generation. I replaced it with faithfulness + a human review sample.
Answer relevancy: Measures whether the answer is on-topic relative to the query. It almost never failed in my pipelines except when the retriever completely whiffed -- a failure mode context precision already catches. Dropped.
Context entity recall: Useful if your use case is entity-heavy (financial reports, medical records). Not useful for my current engagements. Dropped for those systems.
BLEU and ROUGE: I tried these early, partly because they are familiar from NLP evaluation. They are wrong for RAG. BLEU and ROUGE measure n-gram overlap against a reference string. A RAG system that faithfully synthesizes information from three retrieved chunks into a clear answer will score poorly if the answer is not a near-copy of the ground truth. Conversely, a system that copies a retrieved chunk verbatim will score well even if the retrieved chunk was wrong. These metrics measure surface resemblance, not correctness or groundedness. They are fine for machine translation. They should not appear in a RAG evaluation.
from ragas import evaluate
from ragas.metrics import (
faithfulness,
context_precision,
context_recall,
)
from datasets import Dataset
def run_evaluation(samples: list[dict]) -> dict:
"""
samples: list of dicts with keys:
question, answer, contexts (list[str]), ground_truth
"""
dataset = Dataset.from_list(samples)
result = evaluate(
dataset,
metrics=[faithfulness, context_precision, context_recall],
)
return {
"faithfulness": result["faithfulness"],
"context_precision": result["context_precision"],
"context_recall": result["context_recall"],
"n_samples": len(samples),
}
Why faithfulness correlates with user satisfaction
I have tried to be careful about this claim because "metric X predicts user satisfaction" is the kind of thing people assert without the data to back it up.
Here is what I actually observed, not a controlled study: on a client RAG deployment for internal technical documentation (anonymized, financial services sector), I ran a six-week period where I collected explicit user ratings -- thumbs up/down on individual answers -- alongside evaluation metric scores computed on the same queries. I correlated each metric against the rating signal.
Faithfulness had the strongest correlation. Context precision was second. Context recall was a weak third. Answer relevancy was noise.
My interpretation: users are doing something like faithfulness checking implicitly. When a system's answer contradicts something they know, or makes a claim that seems unsupported, they rate it down. When the answer is grounded and cites context they can verify, they rate it up. The phrasing, the format, the length -- those matter less than whether the answer is defensible.
The failure mode that generates the most negative ratings is not "wrong answer to a hard question" -- users forgive the system for missing hard questions. The failure mode that destroys trust is "confident wrong answer to an easy question." That is precisely what a faithfulness failure looks like: the system generated a claim that its own retrieved context does not support.
I might be wrong about why faithfulness correlates -- the causal mechanism is my interpretation. But the correlation held across the six weeks and I have acted on it. Systems with faithfulness below 0.82 on my golden set I treat as not ready for production regardless of the other metrics.
Langfuse tracing setup
I switched to Langfuse from manual logging in late 2024. The main reasons: trace-level visibility into individual retrieval+generation pairs without writing custom logging infrastructure, and the ability to tie production traces to evaluation scores after the fact.
The setup for a LangChain-based RAG chain is minimal:
from langfuse import Langfuse
from langfuse.callback import CallbackHandler
langfuse = Langfuse(
public_key="...",
secret_key="...",
host="https://cloud.langfuse.com", # or self-hosted
)
def get_langfuse_handler(session_id: str, user_id: str) -> CallbackHandler:
return CallbackHandler(
session_id=session_id,
user_id=user_id,
tags=["production", "v2"],
metadata={"corpus_version": CORPUS_VERSION},
)
# In the request handler:
handler = get_langfuse_handler(
session_id=request.session_id,
user_id=request.user_id,
)
result = rag_chain.invoke(
{"question": request.question},
config={"callbacks": [handler]},
)
CORPUS_VERSION is a string I set at deploy time -- the git ref of the corpus snapshot that was indexed. This becomes load-bearing when a quality regression appears and I need to answer "did this start when the corpus changed or when the code changed?"
For systems not using LangChain, I use the Langfuse Python SDK directly with manual span creation:
from langfuse import Langfuse
langfuse = Langfuse()
def traced_retrieve_and_generate(question: str, session_id: str) -> dict:
trace = langfuse.trace(
name="rag-query",
session_id=session_id,
input={"question": question},
metadata={"corpus_version": CORPUS_VERSION},
)
retrieval_span = trace.span(name="retrieval")
chunks = retrieve(question)
retrieval_span.end(
output={"chunks": chunks, "count": len(chunks)},
metadata={"top_k": TOP_K},
)
generation_span = trace.span(name="generation")
answer = generate(question, chunks)
generation_span.end(output={"answer": answer})
trace.update(output={"answer": answer})
return {"answer": answer, "trace_id": trace.id}
The trace.id goes back to the caller and gets stored alongside the response. When a user rates an answer, I log the trace ID with the rating. This means I can pull any low-rated production response into Langfuse and see exactly which chunks were retrieved, in what order, and what the generation call looked like -- without reconstructing it from logs.
The feature in Langfuse I use most heavily is online evaluation scoring. I push RAGAS scores back to Langfuse traces asynchronously:
def score_trace(trace_id: str, scores: dict):
for metric_name, value in scores.items():
langfuse.score(
trace_id=trace_id,
name=metric_name,
value=value,
data_type="NUMERIC",
)
This lets me filter production traces by faithfulness score in the Langfuse UI, which is how I triage quality issues: show me all traces from the last 48 hours where faithfulness < 0.75 and user rating was negative. That intersection is where the actual problems are.
The automated regression gate
The evaluation runs in CI on every pull request that touches retrieval configuration, embedding model, chunking logic, or prompt templates. It does not run on frontend or infra-only changes -- the gate takes 8-12 minutes and I don't want it blocking unrelated work.
The gate script is straightforward:
import sys
import json
from pathlib import Path
THRESHOLDS = {
"faithfulness": 0.82,
"context_precision": 0.78,
"context_recall": 0.72,
}
# Allow metric to drop by at most this fraction relative to baseline
REGRESSION_TOLERANCE = 0.03
def check_regression(current: dict, baseline: dict) -> list[str]:
failures = []
for metric, threshold in THRESHOLDS.items():
current_val = current.get(metric, 0.0)
baseline_val = baseline.get(metric, 0.0)
if current_val < threshold:
failures.append(
f"{metric}: {current_val:.3f} below absolute threshold {threshold}"
)
elif baseline_val > 0 and (baseline_val - current_val) / baseline_val > REGRESSION_TOLERANCE:
failures.append(
f"{metric}: {current_val:.3f} is {((baseline_val - current_val)/baseline_val)*100:.1f}% "
f"below baseline {baseline_val:.3f} (tolerance {REGRESSION_TOLERANCE*100:.0f}%)"
)
return failures
if __name__ == "__main__":
current_path = Path(sys.argv[1])
baseline_path = Path(sys.argv[2])
current = json.loads(current_path.read_text())
baseline = json.loads(baseline_path.read_text())
failures = check_regression(current, baseline)
if failures:
print("REGRESSION GATE FAILED:")
for f in failures:
print(f" - {f}")
sys.exit(1)
else:
print("Regression gate passed.")
print(f" faithfulness: {current['faithfulness']:.3f}")
print(f" context_precision: {current['context_precision']:.3f}")
print(f" context_recall: {current['context_recall']:.3f}")
sys.exit(0)
The baseline file is the last passing evaluation stored as a CI artifact. When a change intentionally improves metrics, the developer updates the baseline as part of the PR. This makes metric improvements explicit and auditable rather than silently shifting what "good" means.
The 3% regression tolerance is intentional slack. Evaluation has noise -- small differences in the judge LLM calls produce variation of 1-2% on the same dataset run twice. Without tolerance, I would be blocking PRs on evaluation noise rather than real regressions. I landed on 3% after watching the distribution of run-to-run variance across about 40 evaluation runs.
The absolute thresholds (0.82 faithfulness, 0.78 context precision, 0.72 context recall) are where I have set them for my current systems. These are not universal values. A support chatbot handling low-stakes queries might tolerate lower faithfulness than a system generating regulatory documentation. The thresholds should come from your user satisfaction data and your tolerance for the specific failure mode each metric catches.
The one thing I would add first if rebuilding
The feedback loop from production to the golden dataset is the piece I still do not have fully automated. When a low-rated production response appears in Langfuse with low faithfulness, that query-answer pair is a strong candidate for the golden dataset -- it is a real query from a real user that the system handled badly. Currently I review these manually and add them to the dataset by hand. I lose some of them because the review cadence is weekly, not continuous.
The right design is: production trace with rating < 0 and faithfulness < threshold gets queued automatically for human review. Human marks it valid, it gets added to the golden dataset with human_verified=True. This closes the loop between what the evaluation measures and the failure modes the system actually encounters.
I have not built this yet because the manual process has been tractable at current query volume. As the systems I maintain grow, it will become the constraint.
Some of this evaluation infrastructure has fed into the observability patterns I have been thinking through for Pipeshift -- specifically how to surface evaluation signal without adding a separate ops burden. Disclosing the founder relationship: I am building Pipeshift.