The Real Cost of Running LLMs in Production

"Just call the API" is cheap in a demo. In production, cost comes from more than raw tokens. Here's the breakdown.

Tokens

Input + output tokens drive the bulk of model cost. Context is the killer: long system prompts, RAG chunks, and conversation history multiply token count fast. One user session can be 10x the tokens of a single prompt. You need per-request or per-feature visibility and caps (e.g. max context length, or fallback to a smaller model for simple turns).

Latency

Latency isn't a direct line item, but it drives infrastructure and UX. If you're blocking on the model, you need more workers or longer timeouts. Streaming reduces perceived latency and often lets you do less work per request (e.g. start rendering before the full answer). Edge routing (e.g. OpenRouter) can cut latency; so can picking a faster model when the task doesn't need the biggest one.

Infra

Your own servers, serverless invocations, queues, vector DBs, and caches. All of that scales with usage. If every request hits a vector DB and then an LLM, your infra cost scales with traffic. Cache aggressively, batch when you can, and keep the critical path lean.

Observability

Traces, logs, and metrics. You can self-host (OTel, Langfuse self-hosted) or use a managed product. Either way there's a cost—storage, ingestion, or seat-based. Budget for it. Without observability you'll pay in debugging time and surprise bills; with it you can see cost per flow and optimize.

Evals

Running evals (deterministic checks, model-as-judge) costs tokens and compute. In CI you might run hundreds of cases per commit. That's intentional—evals prevent regressions—but it's not free. Use a subset of cases for fast feedback; run the full suite on a schedule or before release. Balance coverage with cost.

How to stay in control

Instrument everything: per-request token and cost.
Set budgets and alerts (e.g. daily spend by feature).
Prefer smaller/cheaper models where quality is good enough.
Stream and cache to reduce perceived latency and sometimes total compute.
Treat observability and evals as fixed costs; skimping on them is false economy.

The real cost of running LLMs in production is tokens + infra + observability + evals. Plan for all of it.