I Cut a Client's LLM Bill by $8k/Month With a Routing Classifier. Here Is What the Benchmark Actually Looked Like.

The pitch for model routing is simple: not every task needs GPT-4 or Claude Sonnet. A lot of what enterprise apps ask LLMs to do -- extract a field, classify a support ticket, summarize a paragraph -- can be handled by a model costing 10-20x less per token. Route the cheap work to cheap models, reserve the frontier models for the work that actually needs them. Spend drops.

That's the pitch. The reality is more nuanced, and I want to get into the specific numbers from one engagement where I implemented this -- because the parts that went wrong were more instructive than the parts that went right.

The client was a SaaS company running a document intelligence product. I'm not going to name them. Their LLM stack, pre-routing, was spending roughly $11,400/month across two API providers. By the time I was done, that was down to $3,200/month. The $8,200 delta is real, but the path to it was not the clean A/B story most benchmark write-ups would have you believe.

The baseline, before anything changed

Their production workload broke down roughly like this:

Document field extraction: ~62% of total token volume -- pulling structured fields (dates, entity names, amounts) from uploaded PDFs and emails
Intent/category classification: ~18% -- labeling incoming documents with one of ~40 business categories for routing to downstream workflows
Passage summarization: ~11% -- generating short summaries of document sections for a sidebar preview feature
Structured report generation: ~6% -- assembling multi-page compliance reports from extracted data, with specific format requirements
Open-ended reasoning and analysis: ~3% -- edge-case document handling, unusual schemas, flagged exception workflows

Everything was going to Claude 3.5 Sonnet. The extraction work, the classification work, the summarization -- all of it hitting a frontier model because that was what the team had shipped with initially, and nobody had revisited it. The cost was predictable but nobody had instrumented it at the task level to see the breakdown.

The average cost per API call was $0.019 (blended across input and output tokens at their call volumes and prompt structures). At their request rate of ~600k calls/month, that's where the $11,400 number comes from.

The routing architecture

The approach I implemented: a lightweight classifier that runs before the main LLM call and determines which model tier to use for the actual task.

The classifier itself was a fine-tuned version of Claude Haiku (claude-haiku-3-5 at the time -- I'll use Haiku as a shorthand). The routing decision is binary-ish: "simple" tasks (extraction, classification, summarization) go to Haiku for the main call, "complex" tasks (generation, reasoning, multi-step analysis) stay on Sonnet.

The architecture in production:

request
  |
routing_classifier (Haiku)
  |-- simple --> haiku_executor
  |-- complex --> sonnet_executor
  |
response

The routing classifier call costs ~$0.0002 per request at typical prompt lengths. That's the overhead you're paying on every call to potentially save money on most of them. The math only works if the classifier is accurate enough and if the cost differential between the two paths is large enough.

The cost differential in this case: Haiku calls for extraction tasks were averaging $0.0011 per call vs. Sonnet's $0.019 -- roughly a 17x difference. That's a large enough gap that even an imperfect classifier produces meaningful savings.

The classifier accuracy question

Before I ran this in production, I ran it as an offline evaluation against a labeled sample of 2,400 historical requests -- a mix from all five task categories, labeled by task type.

The classifier's accuracy on held-out data: 93.4% at the binary simple/complex split.

That sounds good. Let me tell you why it is and isn't.

The 6.6% error rate breaks into two failure modes with very different cost signatures:

False simple (routing a complex task to Haiku): Haiku attempts a task it shouldn't, either producing wrong output or requiring a fallback retry on Sonnet. Net cost: one Haiku call ($0.0011) + one Sonnet retry ($0.019) = ~$0.020. That's more expensive than just sending it to Sonnet directly. You've paid a tax.

False complex (routing a simple task to Sonnet): You missed a savings opportunity but didn't produce a wrong result. Cost is the baseline $0.019. No regression, just inefficiency.

The asymmetry matters for where you set your classifier threshold. If you tune the classifier conservatively -- biasing toward "complex" when uncertain -- you reduce false simples at the cost of more false complexes. You leave savings on the table but you don't degrade output quality or add retry costs.

I ran the classifier in production with a confidence threshold of 0.78. Below that confidence, requests default to Sonnet regardless of the predicted label. About 8% of requests fell below that threshold in the first two weeks, which meant 8% of requests were unconditionally on Sonnet. That's expected behavior, not a problem.

The savings-viable accuracy number, given this cost structure: approximately 89% on the simple/complex split, assuming a conservative threshold and no false-simple retries on the tail below threshold. Below 89%, the retry tax and overhead cost of the classifier start eroding the savings significantly. The 93.4% figure gave me enough headroom to be confident before going live.

What tasks were harder to route than expected

Three categories gave the classifier more trouble than I anticipated:

Short extraction requests on unusual document structures. The client's customers occasionally uploaded documents with non-standard layouts -- field labels in non-obvious positions, tables embedded as images (which triggered OCR), or multi-language documents. The classifier had been trained on standard structure and was returning low confidence on these edge cases, which meant they fell into the default-to-Sonnet bucket at a higher rate than I expected. About 40% of sub-threshold requests turned out to be structural edge cases rather than genuinely complex tasks. Fixing this meant adding a document-structure signal to the classifier input -- whether the upstream OCR/parse step had flagged any anomalies -- which bumped the below-threshold rate from 8% down to 5.1%.

Summarization tasks with length constraints. The routing classifier saw "summarize this section" and reliably predicted "simple." But some summaries had tight character-count or sentence-count constraints that were specified in the system prompt and not visible to the classifier as a complexity signal. A "summarize in 3 sentences matching legal register" is a different task from "give me the gist of this paragraph," but both looked like summarization to the classifier. Haiku could handle the former adequately maybe 70% of the time. The other 30% produced outputs that were technically within the word count but lost nuance that reviewers flagged. I added a regex check for length-constraint language in system prompts as a routing override -- if the summary has format constraints, it defaults to Sonnet. Crude, but it resolved the problem.

Single-sentence classification requests with domain-specific terms. The 40-category taxonomy included several categories with overlapping business concepts -- "vendor invoice," "vendor credit memo," and "vendor payment confirmation" are distinct categories that share most of their vocabulary. Short documents landed in these categories at higher ambiguity. The classifier was routing a lot of these to "simple" and Haiku was getting the category wrong at a rate that was unacceptable for a downstream workflow router. These ended up as a named exception: anything predicted to be a vendor financial document below a higher confidence threshold (0.88, not 0.78) routes to Sonnet. The exception handling added about 3 hours of iteration time but prevented a regression that would have been obvious to the end-users.

The benchmark results

I ran the new architecture in shadow mode for 10 days before cutting over -- routing classifier running in parallel, main LLM calls still all going to Sonnet, recording what the routing decision would have been. Then I cut over.

Before (60-day average): $11,420/month

Post-routing (first 60 days of production): $3,185/month

Monthly delta: -$8,235

The breakdown of where savings came from:

| Task category | Pre-routing cost | Post-routing cost | Savings | |---|---|---|---| | Field extraction | ~$7,080/mo | ~$490/mo | $6,590 | | Intent classification | ~$2,055/mo | ~$410/mo | $1,645 | | Passage summarization | ~$1,255/mo | ~$620/mo | $635 | | Report generation | ~$685/mo | ~$670/mo | $15 | | Open-ended reasoning | ~$345/mo | ~$345/mo | $0 | | Routing classifier overhead | -- | ~$250/mo | -$250 | | Total | $11,420/mo | $2,585/mo + $600 retry/overhead | ~$8,235 |

Extraction drove the majority of savings because it was the largest volume category and Haiku handles structured extraction from well-formatted documents extremely well. The accuracy comparison between Haiku and Sonnet on clean extraction tasks -- where the document structure is standard and the fields are unambiguous -- was essentially indistinguishable in the eval I ran. I had expected to see some degradation. There wasn't measurable degradation on clean-structure extraction. Haiku is fine for that task.

Report generation barely moved because the vast majority of report generation calls were correctly labeled "complex" and stayed on Sonnet. That was expected. The classifier's job there was to not misroute them, and it didn't.

Where I'd be cautious about generalizing this

Two caveats worth stating explicitly:

First, the cost differential in this case was large -- 17x between Haiku and Sonnet at their request profile. If you're comparing two models that are 3-4x apart, the math tightens considerably. The classifier overhead and retry tax consume a larger fraction of the savings, and you need higher classifier accuracy to make the economics work.

Second, this was an extraction-heavy workload. Extraction and classification are tasks where smaller models have caught up significantly with frontier models on structured inputs. If your dominant task type is something where the quality gap between model tiers is meaningful -- long-form generation, multi-hop reasoning, tasks requiring precise instruction following -- the routing benefit shrinks because fewer tasks safely move to the cheaper tier.

I might be wrong about the generalizability. But the 90%+ extraction volume was doing a lot of heavy lifting here, and I'm skeptical of routing benchmarks that don't break down savings by task category.

The implementation details that actually mattered

One thing I'd do differently: I built the routing classifier as a separate API call at the start of the request pipeline. That was fine for this client's latency profile (async document processing, no user-facing real-time requirement). For synchronous use cases, a separate pre-call to even a Haiku-class model adds 200-600ms of latency that users will feel. The cleaner architecture for latency-sensitive cases is either a local classifier (a small fine-tuned model running in-process with sub-10ms inference) or rule-based routing augmented with the LLM classifier only for ambiguous cases.

The other thing that mattered: the routing decision needs to be logged with the outcome. Every request should record: what the classifier predicted, at what confidence, which model was used, and whether the output was flagged as incorrect by any downstream validation step. Without that logging, you can't close the feedback loop and improve the classifier over time. I set up the logging from day one and used it to identify the summarization length-constraint issue in week three.

The total implementation time was roughly 12 days across three phases: offline eval and classifier tuning (5 days), shadow mode (10 calendar days, about 2 engineer-days of active work), and production cutover plus iteration on the exception rules (3 days). That's against an $8,200/month savings. The ROI calculus is not complicated.

If you're looking at LLM infrastructure costs and want a task-level breakdown of where the spend is and whether routing makes sense for your workload -- that's the kind of audit I do. Get in touch if you want to talk through the specifics.