The Real Cost of Running Frontier Models at Production Scale

The API pricing page is not your cost. That number -- $15 per million output tokens, $3 per million input tokens, whatever it is this week -- is the floor. The ceiling is what you actually pay after all the overhead that nobody talks about in the getting-started guide.

I've run cost analyses on several LLM systems across client engagements. The pattern that keeps appearing: teams build an initial cost estimate based on the pricing page, then discover six months into production that the real bill is 2x to 3x that estimate. Not because of volume growth -- because of structural waste baked into the system from day one.

This post is the math I've actually run. Specific numbers where I have them. Specific sources of waste, and which optimizations actually move the number versus which ones feel good but don't.

What the pricing page doesn't include

Before getting into specific waste categories, it's worth naming what "API cost" means on the pricing page versus in production.

The pricing page charges you for tokens -- input tokens and output tokens per call. That's it. What it doesn't account for:

Retries (a failed call that still consumed tokens before failing)
Evaluation infrastructure (running evals costs money too -- they're just more API calls)
Prompt engineering overhead (longer, safer prompts mean more input tokens every single call)
Context window waste (retrieved content that isn't relevant still gets billed)
Tool call overhead in agentic systems (each tool result gets stuffed back into the context)

These aren't edge cases. Every production system I've looked at has all of them. The question is how much they cost relative to your "base" call cost.

The client case: 40% of stated cost was waste

A client I worked with had an internal document Q&A system -- standard RAG pipeline, GPT-4o, retrieving from a corpus of internal policies and procedures. They were estimating costs based on: average query length + average answer length + a fixed system prompt, times expected query volume. Clean spreadsheet math.

When I pulled actual token logs from their Azure OpenAI deployment, the real picture looked different.

Their stated cost model assumed approximately 1,400 input tokens per query -- 300 for the system prompt, 800 for retrieved context, 300 for the user question. Output averaged around 400 tokens. That math gave them roughly $0.0048 per query at GPT-4o pricing.

Actual average input tokens: 2,380. Actual average output tokens: 480.

The 1,000-token gap in inputs broke down roughly as follows. About 400 tokens were coming from RAG context that wasn't relevant -- top-k was set to 8 chunks, and typically 3 to 4 of them were semantic noise that scored above the threshold but contributed nothing to the answer. About 300 tokens were from prompt engineering accretion -- over six months of iteration, someone had added clarifications, constraints, output formatting instructions, and a few examples to the system prompt to fix edge cases, and nobody had audited the total token impact. The remaining 300 tokens were tool call returns in a thin agentic layer that had been added later, mostly returning metadata the model rarely used.

At scale, that 70% input token overcount translated directly to a 40% cost overrun versus the original model. The output tokens were less of a factor -- models are generally better about output length than inputs are about precision.

That number -- 40% of their stated cost was structural waste -- is not unusual. I'd say it's roughly typical for systems that have been running and iterating for six months without a cost audit.

Waste category 1: the prompt engineering tax

Every time you add a sentence to your system prompt to handle an edge case, you pay for that sentence on every single call for the rest of the system's life.

This compounds in ways that aren't obvious during development. When you're testing, you're running maybe a few hundred queries. A 200-token addition to the system prompt costs you around $0.0006 per call on GPT-4o input pricing -- basically nothing. At 100,000 queries per month, that same 200-token addition costs $60 per month, every month, forever.

The prompt I looked at most recently -- for a customer support classification system -- had grown to 1,847 input tokens for the system prompt alone. The original version from three months prior was 412 tokens. The additions were all defensible: edge cases caught in production, output format constraints added after a parsing bug, a few-shot example added to fix a classification error. Each individual decision was reasonable. The cumulative cost was not.

The audit question to run: for every sentence in your system prompt, what query does it handle, and what's the frequency of that query type? If a 200-token block handles a query type that represents 2% of volume, you're paying those tokens on the other 98% of calls for no benefit on those calls. Restructure the prompt so the rare-case handling is in a conditional routing path, not in the base system prompt.

I'm not saying this is always worth doing -- sometimes the code complexity of prompt routing is worse than the token cost. But teams rarely even run the calculation.

Waste category 2: context window waste from retrieval

This one is the most reliably expensive waste source I've found in RAG systems.

The standard setup: embed the query, retrieve top-k chunks by cosine similarity, stuff them all into the context window, generate. Top-k is usually set at 5, 8, or 10. The assumption is that more context is better -- if the answer is in there somewhere, the model will find it.

The problem is that cosine similarity is imprecise. At top-k=8, for a typical enterprise knowledge base with a reasonable query distribution, I consistently see 3 to 5 of the retrieved chunks being irrelevant or redundant to the specific query. Those chunks still get billed as input tokens.

Across the systems I've profiled, retrieved context accounts for 35% to 60% of total input token spend. If half of those tokens are waste, you're looking at 17% to 30% of your total input token cost being irrelevant retrieved content.

The fix is not to reduce top-k blindly -- that regresses retrieval quality. The fix is a reranking step between retrieval and generation. A cross-encoder reranker (bge-reranker-large or Cohere Rerank) scores each retrieved chunk against the query directly rather than relying on approximate vector similarity. After reranking, you can cut from top-k=8 to top-3 or top-4 while keeping or improving answer quality. The reranker runs on CPU, costs almost nothing, and the token savings in the LLM call pay for the compute within the first few hours of traffic.

I added bge-reranker-large to one client's pipeline last year. We went from top-8 chunks averaging 1,900 context tokens to top-4 reranked chunks averaging 960 context tokens. Same answer quality on their eval set -- actually marginally better because the lower-quality retrievals weren't introducing noise. Input token cost for the retrieval context: cut roughly in half.

Waste category 3: retry overhead

Model APIs fail. Rate limits, transient errors, occasional timeout. The question is what your retry logic is doing when they fail.

The naive retry pattern: catch any error, sleep for a fixed interval, retry with the same request. This is fine for correctness. It's not fine for cost accounting.

The problem is that a request can fail after the model has started processing -- meaning you've been billed for input tokens even though you received an error response. Whether this happens depends on where in the processing the error occurs and how the provider implements billing. For streaming calls especially, you can receive partial output, hit a rate limit, and be billed for both the input tokens and the partial output tokens before the error.

I've seen retry overhead account for 3% to 8% of total API spend in high-volume systems with aggressive retry logic. That's not huge, but it's completely avoidable.

The things that actually reduce retry cost: exponential backoff with jitter (bunched retries on a fixed interval create thundering herd patterns that cause more failures), circuit breakers at the application level so a degraded provider stops receiving traffic rather than accruing failed-call charges, and tracking retry rate as a metric so you can see when something upstream is causing an unusual number of retries.

Waste category 4: eval costs don't disappear

This one gets ignored in cost models more than any other.

Every eval run is a batch of API calls. If you're running evals in CI on every pull request, and your eval suite has 500 examples at an average of 1,200 input tokens plus 300 output tokens, that's 600,000 input tokens and 150,000 output tokens per run. At GPT-4o pricing, roughly $9 per CI run. At 20 pull requests per month, that's $180 per month just in eval infrastructure.

That's before you account for online evaluations -- using a separate LLM call to judge the quality of the primary model's output, which is a common pattern in agentic systems where ground truth isn't always available. Judge-model calls are typically cheaper (you route them to a smaller model), but they multiply with every production query that gets evaluated.

I haven't seen a team accurately budget eval costs before they build the eval infrastructure. It's always a discovery after the fact.

The levers here: use cheaper models for eval where the judge task doesn't require frontier-level reasoning (GPT-4o-mini or Haiku for binary pass/fail judgments, frontier models only for nuanced quality scoring), run full evals on merge to main rather than on every PR, and cache eval results for prompts that haven't changed.

What actually moves the needle

There are a lot of optimization tactics in this space. In my experience, three levers account for the majority of achievable cost reduction. The rest are real but marginal.

Model routing. The highest-leverage optimization by far. The idea is simple: not all queries need a frontier model. A query asking "what is our vacation policy?" does not need GPT-4o. A query asking the model to synthesize three conflicting policy documents and reason about edge cases might. Routing queries to the cheapest model that can handle them -- typically gpt-4o-mini, Claude Haiku, or a self-hosted Mistral-7B for simple tasks -- can cut your average cost per query by 60% to 80% without meaningfully affecting answer quality on the routed queries.

The implementation question is how you decide which model to route to. The approaches I've used: a classifier trained on your own query data to predict complexity, a rule-based router on query features (length, presence of comparison words, query type classification), or letting a cheap model attempt the query and escalating to the frontier model when the cheap model expresses low confidence. The last pattern has latency overhead; the first two don't.

I implemented a two-tier router for a client's support classification system -- Haiku for queries that matched known patterns, GPT-4o for novel or ambiguous queries. About 73% of queries routed to Haiku. End-to-end cost dropped by roughly 58% against a GPT-4o-only baseline. Answer quality on the eval set dropped by 2 percentage points, which the client accepted.

Prompt caching. Anthropic, OpenAI, and Google all support some form of prompt caching -- the model provider caches the KV state for a static prefix of the prompt so you don't re-process it on every call. The savings for cache hits are significant: Anthropic charges 10% of input token price for cache reads. OpenAI's prompt caching is automatic for prompts over 1,024 tokens and gives a 50% discount on cached tokens.

The catch: caching only helps if your prompts have a long, stable prefix. A 1,500-token system prompt that stays constant across calls is an excellent cache candidate. A system prompt that includes dynamic content at the top -- user name, current date, session context -- breaks caching because the stable portion starts further down.

Restructuring prompts so static content comes first and dynamic content comes last is often the highest-ROI prompt engineering change you can make, purely on cost grounds. I've seen this alone reduce effective input token costs by 35% to 45% for systems with large static system prompts and high repeat-call volume.

Context trimming. The reranking approach I described for RAG waste is a specific instance of a general principle: only put tokens in the context window that are actually doing work. For agentic systems, this means auditing tool call returns and truncating or summarizing tool output that isn't needed for the next reasoning step. For conversational systems, it means summarizing old conversation turns rather than passing the full history forward indefinitely. For RAG systems, it means reranking and pruning retrieved chunks before generation.

The compounding effect here matters. If you reduce input tokens per call by 30% through context trimming, and you run model routing that sends 70% of calls to a model that's 10x cheaper, your effective cost per query can drop by 90% relative to naive frontier-model-with-full-context. These levers stack multiplicatively.

The measurement problem

None of these optimizations are actionable without measurement at the call level.

The minimum you need: logging that captures input token count, output token count, model used, latency, and whether the call succeeded or was retried -- per call, not just aggregated. Aggregated metrics hide the distributions that matter. A system where 80% of calls cost $0.001 and 5% of calls cost $0.50 looks fine in the average but has a cost control problem in the tail.

If you're on Azure OpenAI, the built-in metrics are decent for totals but don't give you per-call token breakdowns easily -- you need to log from the application layer. If you're on the OpenAI API directly, the response object includes usage.prompt_tokens and usage.completion_tokens on every call, and there's no excuse for not logging them.

I build a cost-per-query metric as the primary cost KPI for any LLM system I set up. Not cost per month -- cost per successful query. That's the number that tells you whether your system is economically viable as volume scales, and it's the number that immediately surfaces when an optimization actually worked versus when it just moved cost around.

What I'm uncertain about

A few places where my experience is limited and I'd push back on my own conclusions:

The reranker approach (specifically the move from top-k=8 to top-4 after reranking) worked clearly in the systems I've tested. I'm less confident it generalizes to domains with highly ambiguous queries where redundancy in retrieved chunks is actually useful for coverage. I've seen counterexamples where aggressive top-k reduction hurt answer quality on adversarial or ambiguous queries even when average quality held up.

The 40% waste figure from the client case is a real number, but it's a single client. I've seen similar patterns in other systems, but I haven't done enough systematic measurement across a wide enough sample to claim that 40% is a representative baseline. It might be pessimistic for well-engineered systems. It might be optimistic for systems that have been iterating without cost discipline for longer.

Model routing effectiveness depends heavily on having a good quality eval for the routed tasks. If you route 70% of queries to a cheap model and your eval doesn't cover the long tail of those queries well, you can be shipping degraded quality without knowing it. The cost savings are real; the quality risk is also real and harder to measure.

The routing and context optimization patterns here are directly relevant to what I'm building at Pipeshift -- specifically the cost observability layer. I'm the founder, so take the product reference with the appropriate grain of salt. The cost math in this post is client work, not Pipeshift-specific.