Every engagement starts the same way. The client has a use case -- a domain-specific assistant, a document QA system, a support bot that needs to sound like the company -- and someone on their team has already formed an opinion. Usually it's one of two positions: "we have proprietary data so we need to fine-tune" or "RAG is the modern thing, just do RAG." Neither framing is a decision. They're vibes dressed up as technical positions.
My default answer is RAG first, with a good reranker, and fine-tune only if that fails on specific measurable criteria. I hold that position firmly enough that I'll push back on a client who has already budgeted for fine-tuning before I've asked a single diagnostic question. But the position has nuance, and the nuance is what this post is about.
I've fine-tuned when I shouldn't have. I've also seen teams spend six months building a RAG system for a use case where a two-day fine-tune would have been the better answer from day one. The wrong choice either direction is expensive. Here is the decision tree that keeps me out of the expensive wrong answers most of the time.
The first question: what problem are you actually solving?
Before anything else, I need to know what the model is bad at. This sounds obvious but most clients skip it. They describe what they want the model to do, not what the baseline model fails at when it tries.
I ask this explicitly: take your production query set -- or if you don't have one, a representative sample of 50 real queries -- run them against the base model with no modifications, and tell me the failure mode. The failure mode almost entirely determines the right approach.
There are three categories:
The model doesn't know the information. Internal docs, recent events after the training cutoff, proprietary knowledge, company-specific terminology. The model hallucinates or says it doesn't know. This is a knowledge gap.
The model knows the information but reasons about it incorrectly for your domain. It produces the right facts but wrong conclusions, or applies general-purpose reasoning patterns that don't match your domain conventions. This is a reasoning gap.
The model knows the information but produces the wrong format, tone, or structure. It answers correctly but in the wrong style, with the wrong level of formality, in the wrong output schema. This is a behavior gap.
RAG solves knowledge gaps cleanly. Fine-tuning is relevant for reasoning gaps and behavior gaps. That's the first branch of the tree.
If your use case is "the model doesn't know our internal documentation," the answer is RAG, and fine-tuning the model on your documentation is almost always a waste of money. The model doesn't memorize fine-tuning data the way a database stores records -- it updates weights in ways that improve average behavior on the training distribution but that do not reliably encode specific facts you can query back out. Chip Huyen makes this point clearly in the FLAN paper discussion in "AI Engineering," and I've validated it empirically: you can fine-tune a model on a corpus of internal documents and it will still hallucinate specifics from those documents under distribution shift in the queries. RAG doesn't have this problem because the information is in the context, not in the weights.
The second question: can you retrieve the right information at query time?
Assuming the problem is a knowledge gap and RAG is the candidate: RAG is only as good as retrieval. Before you commit to a RAG architecture, you need to answer whether the information the model needs can actually be retrieved given realistic queries.
The failure modes here are:
Information not in the corpus. Obvious but common. Clients often think they've indexed everything and haven't. Check coverage before you build retrieval.
Retrieval precision is too low. The right document exists but the top-K retrieved results don't include it consistently. This is usually a chunking problem or an embedding model mismatch. I wrote about section-level chunking for structured documents elsewhere -- the short version is that naive 512-token fixed-size chunking fails for hierarchically structured documents and the fix is structural chunking, not a different retrieval algorithm.
Retrieved context is too diffuse. You retrieve 10 chunks and the relevant information is split across four of them with irrelevant material in between. The LLM's attention dilutes. This is where a reranker pays for itself.
I use a cross-encoder reranker by default for any RAG system in production -- currently Cohere's reranker API or a self-hosted cross-encoder/ms-marco-MiniLM-L-6-v2 depending on the client's data residency requirements. The reranker takes the top-20 retrieved chunks and reorders them by relevance to the actual query, then the LLM sees only the top-5. The improvement in answer quality from adding a reranker is almost always larger than the improvement from switching embedding models or tuning chunk size, and it's the step most tutorial RAG stacks omit.
If RAG with a good reranker still fails on your test queries, now I ask: where specifically is it failing? That diagnostic points toward whether fine-tuning is the right next step or whether the problem is something else entirely (usually corpus quality or query ambiguity).
The third question: is this a reasoning gap or a behavior gap?
If RAG is not the solution -- or if it's part of the solution but not sufficient -- I need to know whether the gap is reasoning or behavior.
Reasoning gaps are legitimate fine-tuning candidates. A medical coding assistant that needs to apply ICD-10 coding logic the base model gets wrong. A legal assistant that needs to reason about jurisdiction-specific contract clauses following conventions the model doesn't have. A financial model that needs to apply a proprietary analytical framework to earnings transcripts. These cases have common structure: the base model is applying general-purpose reasoning when your domain has specific reasoning conventions, and those conventions are stable enough to encode in training data.
I fine-tuned a code review model for a client whose codebase had a set of company-specific conventions -- naming patterns, module boundaries, dependency rules -- that the base model consistently missed. The conventions were documented but not in a form that RAG could reliably retrieve at code review time (hundreds of micro-conventions spread across a style guide, not structured as Q&A). Fine-tuning on ~2,000 annotated review examples produced a model that caught convention violations correctly. That was the right call.
Behavior gaps are often easier to solve without fine-tuning. If the problem is tone, format, or style, strong system prompting and few-shot examples in the prompt handle the majority of cases. I'm consistently surprised by how much behavior you can shape with a well-written system prompt and four to six examples before you need fine-tuning. Fine-tuning for behavior changes is legitimate -- it reduces token costs at inference if you're doing a high-volume system -- but it should not be the first attempt.
The cases where I've fine-tuned for behavior and regretted it: a client wanted a customer support bot to match their brand voice. We fine-tuned on 800 examples of their human support agents' responses. The resulting model was better on brand voice but noticeably worse at complex troubleshooting -- the fine-tuning had bought style at the cost of reasoning capability, which is a known failure mode called "alignment tax" or capability regression. We ended up running the fine-tuned model for simple queries and routing complex queries to the base model with a system prompt. That architecture is more complex than either pure approach and we created it by fine-tuning when we shouldn't have.
The hidden cost most teams miss: eval set maintenance
Here is the conversation I have with almost every client who wants to fine-tune, and it's the one where I see the most resistance:
Fine-tuning is not a one-time cost. It is an ongoing cost whose largest component is not compute -- it's eval set maintenance.
When you fine-tune, you need an eval set to know whether the fine-tune worked and whether it regressed on capabilities you care about. That eval set needs to:
- Cover your production query distribution, which drifts over time
- Include adversarial cases and edge cases
- Have human-labeled ground truth for the dimensions you care about
Building the initial eval set for a fine-tune typically takes 1 to 3 weeks of engineering time and requires domain experts to label examples. For the code review model I mentioned: 40 hours of engineer time building the eval, involving two senior engineers who understood the conventions well enough to create reliable labels.
But the eval set is perishable. When the base model updates -- which happens on roughly 6-month cycles for major providers -- you need to re-run your evals on the new base before deciding whether to re-fine-tune. When your codebase conventions change, your eval set goes stale. When your query distribution shifts because the product changes, you need new eval coverage.
The total 12-month cost of a fine-tuned model is roughly: initial fine-tune compute + eval set build + two re-fine-tune cycles when the base model updates + one eval set refresh for distribution drift + engineering time to manage all of this. For a GPT-4o fine-tune with a training set of 2,000 examples, the compute is cheap -- maybe $300-600 for the initial run. The engineering time is where the real cost is, and it compounds.
RAG eval maintenance is also real but cheaper. Your retrieval evals (does the right document come back in the top-5?) are largely stable unless your corpus changes significantly. The eval work is not zero, but it doesn't require re-running training every time the base model updates.
I am not saying fine-tuning is always the wrong choice. I'm saying teams systematically underestimate the ongoing cost relative to the upfront training cost, and that asymmetry drives a lot of bad decisions.
The actual decision tree
Here is the condensed version:
Step 1. Identify the failure mode on your actual query distribution -- knowledge gap, reasoning gap, or behavior gap.
Step 2. If knowledge gap: build RAG with a reranker. Measure retrieval precision and recall at top-5. If it's above 0.7 precision on your test set, ship it. If not, debug retrieval before anything else.
Step 3. If RAG with a reranker passes retrieval quality but still fails on answer quality: audit the failures. Are they knowledge failures (the retrieved content was right but the LLM missed it) or reasoning failures (the retrieved content was insufficient for the reasoning required)?
Step 4. If reasoning gap: fine-tuning is a legitimate candidate. Before committing: Can you build and maintain an eval set with 150+ labeled examples? Do you have domain experts available to label? Is the reasoning pattern stable enough to encode in training data (i.e., will the conventions still be true in 12 months)? If yes to all three, fine-tune.
Step 5. If behavior gap: try system prompt + 4-6 few-shot examples first. Only fine-tune if you've validated the prompt approach fails and you have the inference volume to justify the compute savings.
Step 6. If you're fine-tuning: the eval set is not optional and must come before training, not after. No eval set means you cannot know if it worked. This is a hard requirement, not a nice-to-have.
The cases where fine-tuning was clearly right
The code review model above was one. Another: a client in financial services needed a model that applied their specific options pricing commentary format -- a highly structured, domain-specific output schema with regulatory-constrained language. Few-shot prompting with examples got to ~70% format compliance. Fine-tuning on 1,800 labeled examples got to ~96%. The volume was high enough (30,000 reports/month) that the inference cost savings from using a fine-tuned 70B model instead of GPT-4o paid back the fine-tuning investment in about 45 days. That math works.
A third case: a specialized domain where the base model had genuinely wrong priors. A client building tooling for a niche industrial inspection process found that GPT-4o would confidently apply OSHA standards when the relevant standards were actually a different body's (industry-specific). RAG on the correct standards helped but didn't fully solve it -- the model kept reverting to OSHA patterns under paraphrase attacks. Fine-tuning corrected the prior. I have not found a RAG-only solution for cases where the base model's prior is strongly wrong and the query distribution includes paraphrases that trigger the wrong prior.
The cases where fine-tuning was the wrong call
The customer support voice example above was one. Another: a client wanted to fine-tune because their product had an unusual API with conventions the base model didn't know. Fine-tuning was not the answer -- the documentation was structured enough that RAG on the API docs with section-level chunking worked correctly. The team had been thinking about it as "the model doesn't know our API" (knowledge gap) but was reaching for fine-tuning because of a vague intuition that "the model needs to learn it deeply." The result would have been a fine-tuned model that still hallucinated API behavior under edge cases because fine-tuning doesn't encode facts reliably.
One more case: a startup that wanted to fine-tune to "make the model faster." Fine-tuning a smaller model is a real technique for inference cost optimization -- but only after you've validated the quality bar is achievable with the smaller model. They wanted to start with fine-tuning rather than first benchmarking whether the smaller model even got close to the quality bar without fine-tuning. The order of operations matters: validate quality, then optimize cost. Not the other way around.
Where I might be wrong
My strong prior toward RAG-first may be too conservative for teams with high inference volume from day one, where the compute savings from a fine-tuned smaller model are real and immediate. Hamel Husain's work on LLM evals has moved me to take eval set quality more seriously than I did two years ago -- well-curated eval sets reduce the maintenance burden significantly, which shifts some of the fine-tuning cost calculus. I'm also watching developments in continuous fine-tuning workflows (LoRA adapters updated incrementally rather than full re-trains) that might change the eval maintenance burden over the next 12-18 months.
My current estimate is that for roughly 70% of the use cases I see in consulting, RAG with a good reranker is the better starting point. That number is not based on a formal survey -- it's my impression from the engagements I've run and the failures I've seen. Treat it with appropriate skepticism.
I run a small consulting practice (Optivulnix) where this decision framework comes up on almost every RAG and AI integration engagement. If you're at the RAG-vs-fine-tune decision point and want to compare notes, my calendar is linked from mohakdeepsingh.dev.