Technical strategy for applied AI teams

Make LLM systems cheaper without making them worse.

Practical strategy for teams that want lower LLM spend, sharper system design, and fewer expensive mistakes in production.

Technical note

LLM cost optimization

Most cost blowups are not caused by one bad model choice. They come from the interaction between model pricing, prompt growth, retries, context inflation, and weak operational controls.

The cost problem

LLM costs have a nasty habit of growing exponentially. A prototype that looks harmless at $200 per day can become a $2,000 per day production problem once usage grows, chats lengthen, and prompts absorb every edge case the team has ever seen.

The mechanics are simple: per-token pricing multiplied by usage, context window inflation, and call amplification from retries. That combination is why teams routinely underestimate production spend by an order of magnitude.

  • Context window inflation means every follow-up turn is more expensive than the one before it.
  • Timeout retries, parsing retries, and validation retries can turn one logical request into 2-5 model calls.
  • Over-prompting is common: system prompts drift into 3,000+ tokens as teams patch behavior reactively.
  • Many systems still use GPT-4o for work that GPT-4o mini handles at a tiny fraction of the cost.

A concrete cost-drift example

Here is a representative support-assistant trajectory. In development, the team saw short conversations and low usage. Production introduced long threads, retries, and prompt bloat. By week 7, the economics were completely different.

Cost trajectory for a support assistant using GPT-4o for all queries
WeekSpendUsersPrimary driver
Week 1$200/day50 usersShort queries, short chats
Week 3$800/day200 usersConversation history starts to dominate
Week 5$1,500/day400 usersRetry loops and validation failures multiply calls
Week 7$2,400/day500 usersPrompt bloat + premium-model overuse

After implementing routing, caching, and prompt compression, the same workload dropped from $2,400 per day to roughly $320 per day at 500 users. That is the right mental model for optimization: not one trick, but a stack of compounding improvements.

Model pricing is the first-order constraint

If you do not know your model price ratios, you cannot reason clearly about optimization. The most important number in the table below is not any single price; it is the spread between tiers.

Illustrative model pricing per 1M tokens
ModelProviderInputOutputContextNotes
GPT-4oOpenAI$2.50$10.00128KStrong general-purpose default
GPT-4o miniOpenAI$0.15$0.60128K17x cheaper input than 4o
Claude Sonnet 4.6Anthropic$3.00$15.00200KStrong reasoning, large context
Claude Haiku 4.5Anthropic$0.80$4.00200KFast, lower-cost classification and extraction
Mistral Large 3Mistral$2.00$6.00128KReasonable API alternative
Llama 4 MaverickSelf-hosted~$0.30*~$0.30*1MGPU cost only, utilization-sensitive

The key gap is GPT-4o at $2.50 per million input tokens versus GPT-4o mini at $0.15. That 17x delta is why routing matters. For classification, extraction, and straightforward Q&A, the quality difference is often small while the price difference is massive.

Model routing

Model routing is usually the highest-impact optimization. The core idea is simple: route easy work to cheap models, reserve expensive models for the hard tail, and escalate only on failure or low confidence.

Example routing tiers
TierClassifier scoreModelInput priceTypical work
Simplescore < 0.3GPT-4o mini$0.15 / 1M inputFAQ, extraction, lightweight summarization
Medium0.3 - 0.7Claude Haiku 4.5$0.80 / 1M inputMore nuanced summarization, moderate ambiguity
Complexscore > 0.7GPT-4o$2.50 / 1M inputHard reasoning, long-tail edge cases

A useful implementation pattern is a cascade router. Start with the cheapest viable model, validate the output, and escalate only if the answer fails a quality gate. In many production systems, 70-80% of traffic is simple enough that a cheap model handles it.

  • Customer support example: route 72% of traffic to GPT-4o mini, 20% to Claude Haiku 4.5, and 8% to GPT-4o.
  • Result: monthly spend drops from $38,000 to $6,200, an 84% reduction, without measurable eval degradation.
  • Common implementation choices: embedding-based classifier, keyword heuristic, or a small verifier model.

Semantic caching

If one user asks "What is your return policy?" and another asks "How do I return an item?", the system should not pay twice to discover the same answer. Semantic caching is one of the few optimizations that can improve both latency and cost.

Caching strategies
ApproachHit rateEffortSavingsBest for
Exact match cache10-20%LowLowRepeated identical inputs
Semantic cache30-50%MediumHighSupport or FAQ flows
Prompt-aware cache40-60%HighVery highStable system prompt plus repeated intents
Prefix cachingAutomaticNoneMediumProviders with built-in prompt prefix caching

A typical implementation is Redis plus embeddings. Embed the incoming query, run cosine similarity search, and return a cached response if the match is above a high threshold such as 0.95. Use separate caches per system prompt to avoid contamination.

  • Normalize text before embedding to improve hit rate.
  • Cache at the intent level, not raw string level.
  • Tune the similarity threshold with real false-positive data, not guesses.

Prompt optimization

Prompt optimization is the lowest-effort, highest-return place to start. Many production prompts carry 30-50% dead weight: verbose instructions, stale examples, repeated policies, and output requirements that the system could enforce structurally instead.

  • System prompt compression: 20-40% input token reduction by removing redundancy and consolidating rules.
  • Few-shot to zero-shot migration: 50-80% input token reduction when examples are replaced with tighter instructions or fine tuning.
  • Structured outputs: 30-50% output token reduction by using JSON or tool calls instead of verbose prose.
  • Context pruning: 40-70% input token reduction by summarizing old turns and only passing relevant history.
  • Response length control: 20-60% output token reduction with tighter max-token limits and explicit brevity constraints.

A representative system prompt can shrink from 1,847 tokens to 612 tokens with no quality loss. At 50,000 requests per day on GPT-4o, that alone can save roughly $190 per day, or about $5,700 per month, on system prompt tokens.

Batch processing

If the workload is not interactive, batch APIs are an immediate economic win. OpenAI, Anthropic, and other providers commonly offer around 50% savings for asynchronous processing.

  • Good batch candidates: content generation, backfills, summarization pipelines, evaluation suites, and embedding jobs.
  • Bad batch candidates: chat, moderation, streaming UX, and anything with a hard sub-second SLA.
  • Mixed workload pattern: use a queue to split real-time traffic from batch-eligible traffic.

The practical architecture is straightforward: front a queue such as Redis or SQS, mark latency-sensitive work as synchronous, and push everything else to batch endpoints.

Fine-tuning economics

Fine-tuning is attractive when you have a narrow task, enough examples, and real call volume. The economic reason to fine-tune is not novelty; it is replacing a large model plus a fat prompt with a smaller model whose behavior is already baked in.

Illustrative fine-tuning economics
ApproachCost / 1K callsQualityLatencySetup cost
GPT-4o + detailed prompt$25.0095%High$0
GPT-4o mini + few-shot$1.5088%Low$0
GPT-4o mini fine-tuned$0.9093%Low$50-200
Llama 4 Scout fine-tuned$0.1090%Very low$500-2000
  • Fine-tune when you have a narrow task, 500+ good examples, and enough traffic that inference savings matter.
  • Do not fine-tune when requirements shift weekly or when broad world knowledge is still the main bottleneck.
  • In the example above, a fine-tuned GPT-4o mini can approach GPT-4o quality on a narrow task at a fraction of the inference cost.

Self-hosting open models

Self-hosting can reduce unit economics dramatically, but only if the workload is large enough and the team is willing to own GPU infrastructure, serving, monitoring, and capacity planning.

Illustrative monthly cost of ownership
Option100K req/mo1M req/mo10M req/moNotes
OpenAI API (GPT-4o)$2,500$25,000$250,000No ops, highest marginal cost
GPU rental (A100 80GB)$2,000$2,000$6,000Fixed cost, real ops burden
Owned hardware (H100)$4,500*$4,500*$4,500*Lowest long-run cost, high capex

The break-even is rarely at low volume. Below roughly $5,000 per month in API spend, the operational burden usually dominates the savings. A more realistic midpoint is serverless inference on open models before committing to raw GPU operations.

What to do first

The right sequence matters more than theoretical completeness. You do not start with self-hosting. You start by stopping the obvious leaks, instrumenting the system, and making routing decisions based on data instead of intuition.

Optimization priority matrix
OptimizationEffortImpactSavingsWhen to do it
Prompt compressionLowMedium20-40%Always do first
Model routingMediumVery high60-80%As soon as spend is material
Semantic cachingMediumHigh30-60%When query patterns repeat
Batch processingLowMedium50% on eligible trafficWhen latency is not critical
Fine-tuningHighHigh70-90%High-volume narrow tasks
Self-hostingVery highVery high80-95%When spend or data constraints justify ops

Bottom line

Cost optimization compounds

Start with a baseline of $10,000 per month. Prompt cleanup can plausibly bring that to $7,000. Routing can cut the remainder to roughly $2,100. Caching can take it to about $1,260. Batch APIs can push the total near $1,008. The point is not that every team lands on those exact numbers. The point is that improvements stack.

* Self-hosted costs are approximate and depend heavily on GPU utilization, throughput, and operational overhead.

About us

We are AI practitioners from academia and industry working on cost, reliability, and overall agent performance for teams deploying real systems.