Deya-ai

Technical deep dive

How we think about LLM cost optimization

Cost blowups are not caused by one bad model choice. They come from the interaction between model pricing, prompt growth, retries, context inflation, and weak operational controls.

Most teams discover the cost problem the same way. A manageable prototype becomes an expensive production system, and it's not obvious why. The culprits are usually the same: prompts that grew too long, a powerful model doing simple work, queries answered twice when they didn't need to be. Fixes range from quick wins like compressing prompts or adding a cache to more involved work like fine-tuning smaller models or handing off tasks to purpose-built tools like embedders, rerankers, and classifiers. The latter take more effort but can unlock considerably larger savings.

Don't optimize blind

Cutting costs only counts if the system still works. Before making any changes, the most important investment is a good evaluation set, a test suite that tells you whether your optimizations are helping or quietly breaking things. The structure should follow testing principles most engineers already know: unit tests for individual prompt templates (the fixed instruction plus its variable inputs) to check that each one produces the right output across the range of requests it will see; component tests for each subagent or pipeline stage (does this part behave correctly in isolation?); andend-to-end tests that verify the system as a whole produces the right final result, whether that's a response to a user, a document summary, or a structured output. Many teams skip this early and end up having to undo changes they can't confidently trust. It'sworth doing first, and worth expanding over time as production surfaces new edge cases and bugs.

Model selection: the highest-leverage variable

The single biggest cost lever in most LLM systems is which model handles each request. Across the current generation of models, input prices range from $5.00 per million tokens at the frontier tier to $0.20 at the small tier, a 25x gap.

Current model pricing per 1M tokens (all providers)
Model	Provider	Tier	Input	Cached input	Output
GPT-5.5	OpenAI	Frontier	$5.00	$0.50	$30.00
GPT-5.4	OpenAI	Large	$2.50	$0.25	$15.00
GPT-5.4 mini	OpenAI	Mid	$0.75	$0.075	$4.50
GPT-5.4 nano	OpenAI	Small	$0.20	$0.02	$1.25
Claude Opus 4.7	Anthropic	Frontier	$5.00	$0.50	$25.00
Claude Sonnet 4.6	Anthropic	Large	$3.00	$0.30	$15.00
Claude Haiku 4.5	Anthropic	Small	$1.00	$0.10	$5.00
Gemini 3.1 Pro	Google	Frontier	$2.00	$0.20	$12.00
Gemini 3 Flash	Google	Mid	$0.50	$0.05	$3.00
Gemini 3.1 Flash-Lite†	Google	Small	$0.25	$0.025	$1.50

† Gemini 3.1 Flash-Lite is in preview; pricing may change. Cached input pricing applies to context cache hits. All three providers offer approximately a 90% discount on cached input tokens.

At high volume, routing intelligently across tiers is the primary cost driver.

The goal for each prompt template in your system is to find the smallest model that passes your evaluation set. General principles apply as a starting point. Small-tier models (GPT-5.4 nano, Haiku, Flash-Lite) tend to work well for narrow, well-defined tasks like extraction and classification. Mid-tier models handle requests with moderate complexity and some ambiguity. Large and frontier models are worth their cost for complex multi-step reasoning, long-horizon agentic work, and cases where quality degradation has high downstream cost. These are rough heuristics. The only reliable way to know whether a cheaper model clears your quality bar for a specific template is to run it against your evaluation set.

A practical implementation pattern is a cascade router. Start each request at the cheapest viable tier, validate the output against a quality gate, and escalate only on failure or low confidence. In many production systems, 70–80% of traffic is simple enough that a small-tier model handles it without escalation.

Customer support example

Routing 72% of traffic to small-tier models, 20% to mid-tier, and 8% to a large model reduced monthly spend from $38,000 to roughly $7,500, about 80% reduction, without measurable eval degradation.

KV caching

When you send a prompt to an LLM, the model processes each token by attending to all the tokens before it. That computation produces intermediate results (the key-value pairs that give KV caching its name) which can be stored and reused if the same prefix appears again. This means that if two requests share an identical opening, the provider only processes that shared part once.

Every prompt template has a constant part (system instructions, persona, few-shot examples) and a dynamic part that changes per request. If the constant part comes first, providers can cache it across requests. If the dynamic part comes first, or if the two are interleaved, the cache is unlikely to be used.

KV caching is largely automatic across OpenAI, Anthropic, and Google, though TTLs, minimum prefix lengths, and eligible positions vary by provider. On a well-structured prompt, a generous upper estimate is that around 80% of input tokens are served from cache, at roughly a 90% discount. On a 3,000-token prompt with a 100-token output at large-tier pricing, that drops the per-call cost from $0.009 to $0.0036, a 60% reduction. At 50,000 requests per day, that saves around $270 daily. In practice, cache rates vary and are worth verifying: most providers expose cache hit counts in the response metadata, and checking these is the only reliable way to confirm that your prompt structure is working as intended.

Semantic caching

Semantic caching intercepts a request before it reaches the model and returns a stored response if the input is similar enough to a previous one. If one user asks "What is your return policy?" and another asks "How do I return an item?", the system should not pay twice for the same answer. Unlike KV caching, this is an application-layer optimization you build yourself, and one of the few that can improve both latency and cost.

Two useful scopes: prompt-level caching operates per template, returning a cached response when a new request closely matches a previous one to the same template. Agent-level caching sits at the system boundary and returns a cached response regardless of which pipeline stage would handle the request.

A typical implementation uses Redis plus embeddings with a cosine similarity threshold, though the right threshold is more of a starting point than a fixed value, depending on choices like whether you embed the full prompt or only the dynamic part. Keep separate caches per prompt template to avoid contamination, and tune against real false-positive data.

Prompt optimization

Prompt changes should be driven by accuracy first, and token count second. A prompt that has grown too long, accumulated stale instructions, or relies on examples the model no longer needs is more likely to produce inconsistent outputs. Trimming it for quality tends to reduce tokens as a side effect.

The main compression levers are removing redundancy, consolidating rules that have been patched incrementally over time, and replacing few-shot examples with tighter instructions where the model handles the task well without them. A representative system prompt can shrink from 4,000 to 2,000 tokens with no quality loss. In conversational systems, context pruning (summarizing old turns and passing only relevant history) can cut input tokens by 20-50%.

A reliable evaluation set makes prompt work significantly easier. With one in place, you can use automatic prompt optimization frameworks rather than relying entirely on manual iteration. DSPy treats prompts as a program and uses your eval set to automatically search for better instructions, few-shot examples, and pipeline structure. Similar frameworks optimize prompt phrasing directly against labeled examples. Both approaches require a good eval set to work, which is another reason it is worth investing in early.

Structuring Outputs

Structuring the model's output affects both cost and reliability. Free-form text is harder to parse and more likely to trigger retries when the response does not match what the rest of the system expects. The format also determines how many output tokens are used.

Structured outputs are the baseline. Returning JSON makes responses easier to parse and eliminates a class of retry failures caused by format mismatches. Several providers (including OpenAI and Anthropic) support forced structured output, where the model is constrained to produce valid JSON matching a given schema, guaranteeing the response always matches the expected structure.

Reasoning budget is worth tuning on models that expose it. When a model reasons before answering, that reasoning produces tokens that are billed but not returned to the user. Most providers let you set the reasoning level or token budget, and for tasks that do not require deep reasoning, reducing it can cut output token costs meaningfully.

Beyond JSON: JSON is the standard default, but it is not the most token-efficient format for structured data. Formats like TOON (Token-Oriented Object Notation) are designed specifically to reduce LLM output tokens, cutting token count by roughly 40% compared to equivalent JSON.

Batch processing

Batch APIs are an immediate cost win for any workload where latency does not matter. Offline pipelines that summarize documents, classify records, or scan outputs for errors are natural candidates: the work needs to happen, but nothing is waiting on the result in real time. All major providers offer roughly 50% cost reduction for batch-submitted jobs.

The complication in agentic pipelines is dependency chaining. If each LLM call depends on the output of the previous one, the pipeline cannot be submitted as a single batch job. A smart router that identifies which steps are independent and submits those to batch endpoints, while keeping blocking steps synchronous, can still recover a meaningful share of the savings.

Advanced methods

These methods take more engineering work to research and implement. They pay off at high traffic volumes.

Fine-tuning

Fine-tuning pays off when you have a good dataset and enough traffic to justify the setup. In most cases, the appropriate route is fine-tune per prompt template, or across a small group of templates with closely related tasks. A single fine-tuned model rarely covers an entire pipeline well.

Getting there requires a good labeled dataset (typically 1,000+ samples), a clear choice of training method (SFT, RL, RLHF, and others), and a hosting plan (self-hosted, rented GPU, or through a provider). Two things worth knowing about the provider route: pricing for fine-tuned models may differ from the base model, so check before committing. Provider fine-tuning also typically limits which training methods you can use and how much visibility you have into the training process.

The most common use case is distillation: identify a prompt currently handled by a large expensive model that smaller models cannot reliably handle, then train a smaller model on outputs from the large one. When doing this in production, always verify alignment by sampling outputs from both models and checking for divergence.

Fine-tuning also has a second application: training a model to internalize constant task instructions, removing the need to include them in every call, or to perform a task directly without needing to reason through it, saving tokens.

One caveat: fine-tuned models are less flexible than base models, and significant changes to the task or output format often require retraining.

Moving to non-generative models

Generative models (those that produce free-form text) are versatile but expensive for tasks that do not actually require it. For several common task types, specialized alternatives are significantly cheaper and often more reliable. A 1,000-token classification call on Claude Sonnet with 50% cache hits costs around $0.0024. The same input embedded with Voyage-4-large costs $0.00012 (20x less), with no output tokens at all.

Classification is the most common case. If a generative model is deciding between a fixed set of categories, an embedding model plus a logistic regression or a fine-tuned task head can do the same job at a fraction of the cost. For binary yes/no decisions, a reranker model is often a good fit.

For matching and categorization tasks, where a generative model maps user requests to predefined intents or categories, embedding similarity is a cheaper alternative, though standard embedding models are not task-aware and can produce results that feel weakly related to your specific task. Instruct-tuned embedders and rerankers tend to yield meaningfully better results here.

retrieval augmented generation (RAG) is used less as context windows grow, but retrieving relevant content from a large corpus is still often cheaper than stuffing everything into a long context.

Self-hosting open models

At high enough traffic volume, renting dedicated GPU hardware and running a fine-tuned open model can beat per-token API pricing by a wide margin. Compliance requirements or GPU cloud credits can also make self-hosting the right call, though those situations are more specific.

Consider a prompt running well on Claude Sonnet 4.6, called 1,000 times per hour, with 4,000 input tokens and 100 output tokens per call and 50% of input tokens served from cache. At current Sonnet pricing ($3.00/1M uncached input, $0.30/1M cached, $15.00/1M output), each call costs $0.0081, and 1,000 calls per hour comes to $8.10/hour, roughly $5,800/month. Now assume Qwen3 32B, fine-tuned on this task and served with quantization on a rented H200 GPU. At batch size 10, a single H200 handles around 10,000 such requests per hour, well above the 1,000 needed. The GPU cost is fixed regardless of utilization:

Self-hosting vs. API at 1,000 requests/hour
Option	Cost/hour	Cost/month	vs. Sonnet 4.6
Claude Sonnet 4.6 (API)	$8.10	~$5,800	baseline
Qwen3 32B, Vast.ai H200	$2.32	~$1,670	-71%
Qwen3 32B, AWS H200*	~$5.00	~$3,600	-38%

* AWS minimum billing unit is the full (8× H200 at $39.80/hr). The per-GPU figure treats this as a single GPU's share.

Self-hosting also opens up optimization options that provider APIs do not offer, such as speculative decoding: a small draft model generates candidate tokens in parallel and a larger model validates them, increasing effective throughput by 2-3x without changing output quality.

A note on latency

In most cases, minimizing cost also minimizes latency. Both come down to reducing the number of tokens processed, and smaller models require less computation, so they improve both simultaneously. The main exceptions are batch processing (which trades latency for cost savings by design) and self-hosting, which tends to improve latency under light load but degrade under high concurrent traffic that a provider would handle without issue.

Make LLM systems cheaper without making them worse.

We reduce cost, improve reliability, and make AI systems easier to change.

Reduce waste in production

Match the model to the job

Add evaluation coverage

Improve agent reliability

Want a clearer view of where cost and reliability can improve?

How we think about LLM cost optimization

Don't optimize blind

Model selection: the highest-leverage variable

KV caching

Semantic caching

Prompt optimization

Structuring Outputs

Batch processing

Advanced methods

Fine-tuning

Moving to non-generative models

Self-hosting open models

Want a clearer view of where cost and reliability can improve?

Applied AI practitioners from academia and industry