How a RAG System I Built Was Hallucinating on Every Third Query -- and What I Missed

I got a Slack message at 3:14am on a Tuesday. The client -- a B2B SaaS company, anonymized here -- had deployed a RAG-powered internal knowledge assistant I had built for them about six weeks earlier. The message was from their CTO: "Mohak something is very wrong. The system is citing sources that don't exist for this question. We caught it because an engineer knew the answer was wrong. How many others did we not catch."

That message is the kind that wakes you up regardless of what time it arrives.

This is the postmortem of what went wrong, how I debugged it, and what I should have caught before the system ever went to production.

What the system was supposed to do

The knowledge assistant indexed roughly 2,800 internal documents: product specs, support playbooks, engineering runbooks, past deal notes. The intended user was the pre-sales and support team. The query volume once deployed was around 200-400 queries per day. The LLM was GPT-4o via Azure OpenAI, the vector store was Qdrant 1.9, the embedding model was text-embedding-3-large. The retrieval was a standard top-5 cosine similarity search with no reranker and no score filtering. The generator was a straightforward RAG prompt: "Answer the question using only the documents below."

The offline evaluation I ran before deployment used 47 manually curated question-answer pairs drawn from support tickets. Retrieval precision@5 was 0.81. The eval set looked solid. I moved to production feeling reasonably confident.

I was wrong to feel confident, and the reason why tells me something I should have known already.

What was actually happening

The CTO's message pointed at one failure. By morning I had identified the pattern: roughly one in three queries was returning plausible-sounding answers with citations that were either wrong (the document existed but did not say what the answer claimed) or invented (the document ID was hallucinated entirely). The affected queries had a common characteristic I missed immediately: they were phrased in the language of the user's customers, not in the language of the internal documents.

Support reps and pre-sales engineers had started using the system, and their queries sounded like: "what should I tell a customer who says they can't get SSO to work with Okta" or "customer is asking about data residency for EU contracts, what do I say." The internal documents were written in product-manager-and-engineer language: "SAML 2.0 integration configuration," "data sovereignty compliance posture," "regional data processing addendum."

The query-document vocabulary gap was significant. A cosine similarity search on text-embedding-3-large embeddings does handle semantic similarity well enough to find the right document conceptually. But when the top-5 retrieved chunks had cosine similarity scores of 0.61, 0.59, 0.57, 0.55, 0.52 -- all above the implicit "good enough" threshold of "returned anything" -- nothing in the pipeline was asking whether any of those were actually relevant.

GPT-4o was receiving five marginally relevant chunks and a "use only the documents below" instruction. Rather than saying "I cannot answer this from the provided documents," it was synthesizing an answer from partial context and hallucinating the specific details needed to make the answer sound complete. The citations attached to those answers pointed at whichever retrieved chunks had been used as the generation scaffold, not at documents that actually contained the answer.

What my offline eval missed and why

My 47-question eval set was drawn from resolved support tickets. Those tickets were written by support engineers describing problems in internal language, then resolved by the same engineers who wrote the documents. The question-document vocabulary match in the eval set was much closer than the question-document vocabulary gap that appeared in real use.

This is a known problem. Hamel Husain has written about it clearly -- eval sets that are constructed from the same distribution as the corpus will overestimate real-world retrieval quality. The fix is to include adversarial or out-of-distribution queries in the eval set before deployment. I knew this in principle and did not do it in practice.

The second thing the offline eval missed: I was measuring retrieval precision@5 but not setting a minimum relevance threshold. If the top-5 retrieved chunks had scores of 0.61-0.52, that would count as a successful retrieval in my eval as long as the right document appeared somewhere in the five. That is a flawed metric for a system where the generator will produce an answer from whatever it receives. A retrieval that returns five marginally relevant documents is not equivalent to a retrieval that returns two highly relevant documents and refuses to speculate from the rest.

I should have measured answer faithfulness against retrieved context as a separate metric -- not just whether retrieval found the right documents, but whether the generator's output was actually supported by what it received. The RAGAS framework covers this with its faithfulness metric. I was aware of RAGAS and used it partially. I did not apply the faithfulness evaluation to my eval set before deployment. That is the clearest thing I got wrong.

The 3am debugging timeline

After the Slack message I did not go back to sleep. Here is the sequence:

3:20am. Pulled the last 48 hours of query logs from the Azure OpenAI instance. The logs store the full prompt including the retrieved context. Identified 12 queries with clearly wrong answers in the logs. Checked the retrieval scores on each: all 12 had max cosine similarity below 0.65.

4:10am. Ran the 12 failing queries manually against Qdrant with with_vectors=False, with_payload=True. Confirmed the retrieval was returning plausible-but-wrong documents in all 12 cases. The documents existed; they were just not the right documents for the question.

4:45am. Ran the same 12 queries through a cross-encoder reranker (cross-encoder/ms-marco-MiniLM-L-6-v2, running locally) on the top-10 retrieved chunks. In 9 of 12 cases, the reranker's top-1 document was more relevant than the retriever's top-1. In 6 of those 9, the reranker's top-1 document actually contained the correct answer.

5:30am. Drafted a message to the CTO: here is what is happening, here is what I am going to fix, here is the expected timeline. Sent it at 6am rather than 5:30 because I wanted to have a patch plan ready before opening the conversation.

Day 1. Deployed a temporary fix: added a retrieval score threshold of 0.70 (Qdrant cosine similarity, so higher is more relevant; this threshold rejects the bottom tier of retrieval results). For queries where no retrieved chunk met the 0.70 threshold, the system now returns "I could not find relevant information in the knowledge base for this question. Please escalate to the product team." Configured as a fast fix -- no reranker yet, just a gate on the retriever output.

Day 3. Deployed the reranker. Integrated cross-encoder/ms-marco-MiniLM-L-6-v2 as a second-stage retrieval step: retrieve top-15 by cosine similarity, rerank with the cross-encoder, pass top-4 to the generator. Cross-encoders are compute-expensive relative to bi-encoder retrieval; running them on 15 candidates per query added roughly 180ms latency on the query path, which was acceptable for this use case.

Day 5. Rebuilt the eval set from scratch. Added 60 queries written in customer-facing language by asking three support reps to write questions the way their customers write them. Added 30 deliberately ambiguous queries where the correct answer would be "I don't have enough information." New eval set: 137 questions total. Precision@5 on the old eval set had been 0.81; on the new eval set, the pre-fix system scored 0.53. That number should have been the pre-deployment number. It was not, because I never built an eval set that reflected actual user language.

Day 7. Ran the full new eval set against the patched system (score threshold + reranker). Precision@5 was 0.88 on the combined set. More importantly, RAGAS faithfulness on the 137-question set was 0.91 -- meaning in 91% of cases, the generator's answer was fully supported by the retrieved context. The 9% where faithfulness was below 1.0 were all cases where the generator used partial context to answer a compound question; that is a generation problem, not a retrieval problem, and a separate fix.

What the fix actually is

The root cause was retrieval returning low-relevance documents with no gate on minimum relevance, combined with an offline eval set that was too close to the corpus vocabulary to catch it.

Three things changed the system's behavior:

A retrieval score threshold. Cosine similarity of 0.70 as a minimum. Below this, the system abstains rather than generating from insufficient context. The exact threshold was calibrated empirically on the 137-question eval set -- I swept from 0.60 to 0.80 in 0.05 increments and found 0.70 balanced recall (not cutting too many genuinely answerable questions) against precision (not letting marginal retrievals through). The threshold is document-corpus-specific; it will be different for a different deployment.

A cross-encoder reranker as the second retrieval stage. The bi-encoder retrieval (cosine similarity on text-embedding-3-large) finds candidates efficiently. The cross-encoder reranker sorts those candidates by a more expensive but more accurate relevance judgment. This is the standard two-stage retrieval pattern described in the BEIR paper and in Jerry Liu's LlamaIndex documentation. I knew this pattern and did not implement it initially because I thought the first-stage retrieval was good enough. It was not.

An eval set built from actual user language. The 137-question set now includes adversarial queries, out-of-vocabulary queries, and ambiguous queries. It gets extended every two weeks by adding 10 questions sampled from the query log. The eval is not static. Eval sets that are built once and never updated become stale as user behavior evolves.

What I would do differently from the start

Not implement retrieval without a score threshold. The "return top-K regardless of score" pattern is a tutorial default that makes sense when you are prototyping and makes no sense in a production system that will return answers to real users. A generator receiving five low-relevance chunks will hallucinate because hallucination is the path of least resistance when context is insufficient. The threshold is a one-line configuration change in Qdrant; there is no reason not to set it.

Build the eval set in two parts: one drawn from the corpus to validate that retrieval finds documents when the user already knows the right vocabulary, and one drawn from actual or simulated user queries to validate that it handles real language. The ratio I use now is roughly 40/60 corpus-drawn versus user-drawn. The user-drawn questions are harder to source at the start of a project but essential: ask the people who will use the system to write 30 questions as if they were asking a colleague, before they know how the system works.

Measure answer faithfulness, not just retrieval precision. Retrieval precision tells you whether the right documents were found. Faithfulness tells you whether the generator stayed within those documents. A system can have high retrieval precision and still hallucinate if the retrieved context is sparse and the generator fills the gaps. These are separate failure modes that require separate metrics.

Implement the two-stage retrieval pattern before production, not as a post-incident fix. The compute cost of a cross-encoder over 15 candidates is real but usually not the binding constraint for a knowledge assistant with low-to-medium query volume. The latency cost at 180ms is acceptable for most non-streaming use cases. The benefit -- meaningfully better relevance ranking than cosine similarity alone -- is documented in the retrieval literature and my own data confirmed it. I did not implement it initially because the offline eval made the first-stage retrieval look good enough. The lesson is that "offline eval passes" is not the same as "system is ready for production user distribution."

I might be wrong about the threshold value I landed on. 0.70 cosine similarity on text-embedding-3-large for this corpus is calibrated to this specific document collection and this query distribution. A corpus with longer, denser documents would likely need a different threshold. A corpus where documents are highly homogeneous would likely need a higher threshold because more documents will pass a low bar. The threshold needs to be calibrated per deployment, not copied from a postmortem.

The structural problem with how RAG systems go to production

This incident is not unusual. I have seen the same failure mode -- good offline eval, bad production behavior because the eval distribution did not match the user distribution -- described by enough practitioners that it is clearly a common pattern rather than a one-time mistake. Jerry Liu has talked about it. Hamel Husain's eval writing covers it. The retrieval literature covers query-document distribution shift explicitly.

The reason it keeps happening is that offline eval sets are almost always built by the people who also built the corpus. Those people know the vocabulary of the documents. Their test queries are written in document vocabulary. The eval looks good. The system ships. Real users, who do not know the internal vocabulary, get back hallucinated answers.

The fix is structural: treat the eval set as a first-class artifact with the same rigor as the index. Get user queries before deployment, not after. Make query sampling from production logs an ongoing part of the system's maintenance, not a post-incident remediation.

Six weeks from deployment to incident is six weeks of a production system operating outside its validated operating envelope. That is the cost of the gap in my initial eval methodology.

This incident informed the retrieval evaluation approach I now use as a default in RAG system design. If you are building or auditing a production RAG system and want a second opinion on the eval methodology before going live, my calendar is open.