Skip to content
Back to Blog
self-hosted-llm-cost-comparison

The Calculation I Run Before Sending a Request to OpenAI

I did not set out to reduce my OpenAI usage. I set out to understand what I was actually paying for, and the answer surprised me enough that three workload categories are now running on something else.

I did not set out to reduce my OpenAI usage. I set out to understand what I was actually paying for, and the answer surprised me enough that three workload categories are now running on something else.

This is not an anti-OpenAI post. GPT-4o and o3 are still the right answer for specific things and I use them. But the framing of "frontier API vs self-hosted" as a binary choice is wrong, and I see engineering teams default to frontier APIs for workloads where they are paying a 10-20x cost premium for capabilities they are not using.

Here is the calculation I actually run, and what I concluded.

The four-variable decision

Every time I evaluate a new workload, I run it through the same four questions:

1. What is the cost per query at my projected volume? 2. What is my latency requirement -- interactive or batch? 3. Does this data leave a trust boundary I care about? 4. Do I need to fine-tune on proprietary data?

If the answer to questions 3 or 4 is yes, the decision is mostly made before I look at cost. If the answer to both is no, cost and latency dominate, and that is where the trade-off is interesting.

The workloads I moved off OpenAI

Document classification and extraction

I have a client (an enterprise logistics company -- anonymized) whose pipeline classifies incoming freight documents into about 80 categories: bill of lading, commercial invoice, customs declaration, dangerous goods declaration, and so on. Each document gets a classification label and a structured extraction of a fixed field set. The pipeline processes around 40,000 documents per month, which sounds large but is not enormous by LLM standards.

Running this on GPT-4o at $5/million input tokens, with an average document size of around 800 tokens plus prompt overhead of ~400 tokens, came out to roughly $480/month just on input tokens, plus output tokens for the extractions. Actual spend was landing around $650-700/month.

I switched this to Llama-3.1-70B-Instruct via Together AI at $0.88/million tokens (as of Q1 2026). Same pipeline, same prompt structure. Monthly spend dropped to around $120. Classification accuracy on the 80-label taxonomy dropped by about 2 percentage points on ambiguous edge cases -- documents that fall into two plausible categories. That 2-point gap was worth $530/month, so it stayed on Together AI with a human review queue for the 3-4% of documents that score below a confidence threshold.

The point is not that Together AI is always better. The point is that I measured the accuracy gap and priced it. At 98% accuracy on unambiguous cases, the economics were clear.

RAG-grounded Q&A over internal documentation

The second workload is a retrieval-augmented Q&A system over internal technical documentation. The corpus is fixed, the queries are structured, and the task is mostly: find the right retrieved context, synthesize a coherent answer, cite the source sections. It is not a reasoning task. It is a recall-and-synthesis task.

For this type of workload, the frontier model's reasoning depth is almost entirely wasted. Mistral-7B-Instruct v0.3, self-hosted on a single A10G instance on OCI, handles it well enough that the output is indistinguishable from GPT-3.5-turbo output on 90%+ of queries. The A10G instance costs around $1.20/hour on OCI. At the query volume this client runs (around 8,000-10,000 queries per month, with most traffic clustered in business hours), the inference cost is well under $100/month equivalent. OpenAI spend for the same volume was around $280/month.

The infrastructure overhead: one GPU instance, a quantized GGUF model loaded via llama.cpp, a thin FastAPI wrapper, and Nginx in front of it. Setup time was about a day. Ongoing maintenance is low because the corpus is static and the model does not update. I would not run this setup for a team that does not have someone comfortable managing GPU instances. For a team that does, it is straightforward.

Structured data generation for synthetic training sets

This is the most niche use case but worth naming. I generate synthetic training data for fine-tuning other models -- specifically, question-answer pairs derived from proprietary documentation. The data cannot leave the client's environment. That constraint alone removes frontier APIs from the option set regardless of cost.

For this workload, I run Mixtral-8x7B-Instruct (also via llama.cpp, same OCI instance) because the instruction-following quality on structured JSON output is better than Mistral-7B at this task. Privacy constraint decided the model class; the specific model choice was based on output quality benchmarks I ran against the target format.

The workloads that stayed on frontier APIs

Anything requiring multi-step reasoning or planning. I build agentic pipelines through Pipeshift (I am the founder, relevant to disclose here) where the core loop involves a model deciding which tools to call, in what order, and how to handle partial results. Open-source 7B and 13B models fail on this. Llama-70B is getting better but still misses on complex multi-step chains. o3 and GPT-4o handle these reliably. The cost per successful agentic run is high but justified because the alternative is failed runs that require human intervention.

Code generation where correctness is load-bearing. When I use an LLM to generate infrastructure code -- Terraform, Kubernetes manifests, CI/CD pipeline definitions -- the output goes through review but not exhaustive testing before it influences production. I need high first-pass accuracy. Claude Sonnet/Opus and GPT-4o are meaningfully better here than any model I have tested at the 7B-70B range. The delta in correctness is large enough that the cost premium is worth it.

Tasks where the output quality determines whether a human acts on it at all. If a model produces a summary that a senior engineer reads and immediately distrusts, the time cost of that failure outweighs months of inference savings. Frontier models produce output that feels more reliable on tasks where "feels right to a senior engineer" is the acceptance criterion.

I might be wrong that the quality gap on reasoning tasks is as large as I perceive it. The open-source models are improving fast -- the gap between Llama-3.1-70B and GPT-4o is smaller than the gap between Llama-2-70B and GPT-4 was. I expect to revisit the agentic workloads in 12 months and find a different answer.

The infrastructure overhead I did not expect

The cost comparison above is incomplete because it excludes infrastructure overhead. This is the part that gets left out of most "self-hosted LLMs are cheaper" posts.

Running a single A10G instance for a low-to-medium query volume is manageable. Scaling that to handle traffic spikes is not trivial. The logistics classification pipeline hits peaks when carrier systems sync -- three or four times a day, it processes 500 documents in 20 minutes. On a single inference instance, queuing logic matters. I spent about two days building a simple job queue with Redis and a retry mechanism before the latency characteristics were acceptable.

Quantization also changes output quality in ways you do not fully know until you test on your specific data. I use Q4_K_M quantization for the Mistral-7B deployment. That has a measurable accuracy impact on certain extraction tasks -- specifically, tables with multi-level headers. I found this by running an offline eval against a labeled test set of 500 documents. If you skip that eval, you ship a quality regression you cannot see.

Model updates on frontier APIs are automatic and you often do not notice. Model updates on self-hosted deployments are a deployment event. The logistics client's pipeline was tuned against Llama-3.1-70B. When 3.2 came out, I had to re-run the prompt optimization and re-validate classification accuracy before upgrading. That took about a day. It is not a huge burden but it is a real one.

What I would tell a team making this decision for the first time

Do the per-query math before you do anything else. It is surprising how often teams have not done it. Take your projected monthly volume, multiply by average token count per request, apply the frontier API price, and write down the number. Then do the same for Together AI's current pricing on Llama-70B. If the delta is less than $200/month, the self-hosted or alternative-hosted infrastructure overhead is not worth it -- just stay on the frontier API and move on.

If the delta is material, the next question is accuracy parity. You need a labeled test set for your specific task. Do not use public benchmarks as a proxy for your workload. Public benchmarks measure general capability; your task has a specific distribution that may or may not overlap with benchmark tasks. Build a test set of 200-500 real examples, run both models against it, and measure the accuracy gap. Then decide whether the cost savings justify the gap.

Privacy constraints are a hard gate, not a factor to weigh. If the data is subject to contractual confidentiality, GDPR with a data residency requirement, or any obligation that prohibits third-party processing, frontier APIs are not an option regardless of the cost analysis. This seems obvious but I have seen teams try to get creative about what "processing" means. It does not end well.

The infrastructure competence question is real. Running llama.cpp on a GPU instance is not complicated. Running it reliably, with proper health checks, a job queue, graceful restarts, and an alert when the model process dies at 2am, requires someone who is comfortable doing that work. If your team does not have that person, Together AI or Anyscale Endpoints give you the open-weight models on managed infrastructure -- the cost is higher than self-hosting but lower than frontier APIs, and the operational burden is zero.

The honest version of the conclusion

I did not move workloads off OpenAI because I was unhappy with OpenAI. I moved them because I did the math and found I was paying for capabilities I was not using. The freight document classifier does not need GPT-4o's reasoning depth. The RAG Q&A system does not need frontier-scale world knowledge. Those workloads ran well on smaller models at a fraction of the cost.

The workloads that stayed on frontier APIs stayed there because the quality gap was real and measurable, not because of brand loyalty or inertia. If Llama-4 or Mistral's next generation closes the gap on multi-step reasoning, I will run the same calculation and make the same call.

The framework is not "self-hosted is better." It is "measure the capabilities you actually need, price the options that meet those requirements, and make the choice with numbers in front of you rather than defaults."

The agentic pipeline work referenced here is built on Pipeshift -- I am the founder. The logistics client and internal documentation workloads were consulting engagements through Optivulnix, my DevOps and AI infrastructure consulting practice.