The demo looks like this: user types a question in natural language, a friendly response appears that cites specific records, the founder calls it "AI-powered." Investors nod. Engineers ship it.
Then I get brought in to look at the infrastructure and the bill is $800/month on a product with 200 active users. The team is burning tokens on GPT-4o for every query. The underlying operation, stripped of its language interface, is: find records matching a filter, sort by recency, format a summary. That is not intelligence. That is retrieval with a chatbot skin.
I am not being dismissive of what they built. The UX is genuinely better than a bare search box and a results grid. But calling it "AI" when the hard problem is a BM25 index and a Postgres full-text query is not just imprecise -- it causes the team to make wrong architectural choices, pay 100x more per query than they need to, and feel confused when they finally hit a problem that requires actual language model reasoning.
Here is the distinction I keep having to draw on consulting calls, and why it matters more than taxonomy.
The actual spectrum
Most "AI features" in B2B SaaS I have audited fall into one of three categories:
Category 1: Retrieval with a language interface. User asks "show me invoices overdue by more than 30 days for customers in Germany." The correct implementation is a structured query generated from a constrained input, run against an indexed database, returned as a formatted list. The LLM's only legitimate role here is parsing the natural language input into structured parameters -- and even that can be replaced with a well-designed UI or a lightweight intent classifier. The generation step adds almost no value beyond formatting, and formatting is not worth $0.01/query at scale.
Category 2: Retrieval plus lightweight synthesis. User asks "summarize our Q2 performance vs plan across all product lines." The data lives in a database. Retrieval fetches the relevant rows. The synthesis step -- turning a table of numbers into coherent prose -- is where an LLM earns its place, but the cost driver is still the retrieval, and a small model (gemini-flash-1.5, gpt-4o-mini, claude-haiku) handles the synthesis competently at 1/10th the cost of the frontier model the team reached for first.
Category 3: Generation over novel combinations. User asks "given our cost structure and these three market scenarios, draft the pricing memo our board should see." There is no row in a database that answers this. The model needs to reason across sources, hold multiple scenarios in tension, and generate coherent arguments that did not exist before the prompt ran. This is where GPT-4o or Claude Sonnet earns its rate.
Most features I see deployed as category 3 are actually category 1. The team built the Cadillac of category 1 and is paying category 3 prices for it.
What the cost difference actually looks like
One client I worked with earlier this year -- a B2B supply chain SaaS, anonymized -- had a "natural language querying" feature that let users ask questions about their shipment data. They were routing every query through claude-3-5-sonnet at approximately $0.015 per query (input + output tokens, their average query). At 50,000 queries/month that is $750/month for that feature alone, growing linearly with usage.
I audited 500 production queries. 73% of them were filter-and-retrieve: "show shipments delayed more than 3 days," "list suppliers with open invoices," "which warehouses are below 60% capacity." These queries have no ambiguity that requires language model reasoning. The natural language just needs to be parsed into a WHERE clause.
The remaining 27% were genuine synthesis tasks: cross-referencing delay patterns against carrier performance, explaining anomalies, drafting exception reports. Those legitimately benefit from a capable model.
The fix: an intent classifier (a fine-tuned distilbert model, inference cost well under $0.0001/query) routes the 73% to a structured query layer. The 27% goes to the LLM. Frontier model query volume drops 73%. Monthly cost for that feature: approximately $200. Same user experience for the filter queries, which were the majority, because a database lookup is faster than an LLM call anyway.
I might be underestimating the classification error rate here -- a bad classification that sends a genuine reasoning query to the structured layer degrades the experience in a visible way. But the threshold question is: what error rate makes the cheaper architecture worth it? At 2% misclassification you are still saving 70% of costs and adding one fallback path.
The architecture argument, not just the cost argument
Even setting aside cost, reaching for an LLM to answer structured questions is the wrong architecture because it introduces a failure mode you do not have with deterministic queries: the model's answer can be plausible but wrong.
I have seen this in production. A customer asks "which of our accounts is most overdue?" The LLM has retrieval context from a vector search over account records. The vector search returns the top-5 semantically similar records, but "most overdue" requires a MAX aggregation over a numeric field, not semantic similarity. The model answers confidently with the account that appeared most prominently in the retrieved chunks -- which might not be the one with the highest numeric balance.
A SQL query does not hallucinate the answer to MAX(balance). An LLM can. For questions with objectively correct answers that a database can compute deterministically, the LLM introduces a new class of error that the underlying data system does not have.
This is the specific failure mode Hamel Husain has written about when he distinguishes "retrieval tasks" from "generation tasks" in RAG evaluation -- the quality bar you need to apply is completely different, and teams often do not realize they are evaluating two different problem types with the same metrics.
When a real LLM is actually necessary
I do not want this to read as LLM skepticism. There are specific capabilities that require a language model and cannot be replicated cheaply or at all with retrieval:
Generation of novel content. Drafting, rewriting, translating, summarizing across heterogeneous sources. The model is not retrieving an answer; it is constructing one that did not exist before.
Reasoning over novel combinations. "Our churn rate is up 12% in Q3 while our NPS is flat -- what hypotheses should we investigate?" There is no row in the database for this. The model needs to reason from structure, not retrieve from index.
Code generation and transformation. Even simple code completion is genuinely a generation task. The model is not retrieving a code block; it is synthesizing one that fits the local context.
Understanding ambiguous or underspecified intent. When a user's query is genuinely ambiguous and the right response is to ask a clarifying question or produce multiple interpretations, a language model is the right tool. A classifier will route it incorrectly; a structured query layer will fail.
Contextual explanation of complex outputs. "Why did this credit application get rejected?" with access to the decision factors is a synthesis task that benefits from a model that understands causal language.
The common thread: the LLM adds value when the task requires constructing something that is not directly retrievable. The moment the output is directly computable from stored data, the LLM is overhead.
Why teams reach for the wrong tool
Part of this is that the LLM path is faster to prototype. You stuff some records in a prompt, ask the model a question, get a plausible answer, and ship it. The structured query path requires thinking about schema, building a query layer, handling edge cases. The LLM papers over all of that -- until you are in production and the bill arrives.
Part of it is that "AI-powered" is a positioning choice, not a technical description. This is fine but teams should be honest with themselves that they are using "AI" as a label for what is essentially a better search UI.
And part of it is that the people making the architectural choice are often not the people who will be debugging a hallucinated answer six months later or explaining to the CFO why the infrastructure bill scaled with query volume.
I have been building AI features for two years across clients in supply chain, fintech, and enterprise software. The pattern holds: the teams that are thoughtful about this distinction early end up with cheaper, more reliable, and more maintainable systems. The ones that default to LLM-for-everything spend the first six months happy with the product and the next six months unhappy with the costs and the edge case failures.
The practical test
When I evaluate a proposed AI feature now, I ask: if the user's question were a database query, what would that query be? If I can write it in 30 seconds, the feature is a retrieval problem. If I cannot -- if the answer genuinely requires synthesizing across inputs that have no fixed schema relationship, or generating content that does not exist in the data -- then an LLM is doing real work.
Most of the time I can write the query in 30 seconds.
The LLM interface is still worth having for a lot of those features. Natural language is a better UX than a form builder, even when the underlying operation is a filter. But you should use a lightweight model for the NL-to-query parsing step, run the deterministic query, and format the result. You do not need a frontier model to turn "show me overdue invoices in Germany" into WHERE country = 'DE' AND days_overdue > 30.
Get the architecture right and you can afford to use the frontier model where it actually matters.
I run into this distinction constantly in consulting work through Optivulnix. If you are designing an AI feature and want a second opinion on whether the architecture matches the actual problem, I am happy to look at it -- calendar link.