← Blog

What Actually Breaks When You Ship an LLM Feature to Production

Latency variance, guardrails, retries, cost explosions, and prompt drift—the things nobody tells you until you're on fire at 2am.

Everyone can get an LLM to say "hello" in a notebook. Shipping it to production is where the real work starts—and where most teams get surprised.

Here's what actually breaks.

Latency variance

Same prompt, same model, wildly different response times. A request that took 800ms in staging can hit 4s in prod. Users don't tolerate that. You need timeouts, streaming so the first token lands fast, and fallbacks (e.g. a faster model or a cached path) when the primary path is slow. Without that, your "AI feature" feels broken half the time.

Guardrails

Inputs and outputs both need guardrails. Prompt injection, jailbreaks, PII leakage, off-brand tone—all of it shows up in prod. You need input validation (length, content filters) and output checks (format, safety, policy). Most teams bolt this on after the first incident. Do it before.

Retries and fallbacks

Models and providers fail. Rate limits, timeouts, bad responses. You need retry logic with backoff, and ideally a fallback model or provider so the feature degrades instead of dying. "Call OpenAI once and hope" is not a production strategy.

Hallucination mitigation

The model will make things up. For high-stakes flows you need: structured outputs so you can validate shape, optional fact-check or citation steps, and clear UX when the model says "I don't know." Treat hallucination as the default; design around it.

Cost explosions

Token usage scales with users and context. A feature that cost $50/month in beta can be $5k/month at scale. You need per-request or per-feature cost visibility, budgets, and sometimes hard limits (e.g. cap context length or model tier). Observability and eval pipelines help you see cost before it becomes a crisis.

Prompt drift

Prompts change. So do models. A prompt that worked last week can degrade after a model update or when someone "improves" the copy. You need versioning, A/B or canary rollouts, and evals that run on every change. Otherwise you're debugging in the dark.

The stack that doesn't break

None of this is optional if you care about reliability. The teams that ship and sleep well use: guardrails → prompt assembly → model routing (with fallbacks) → streaming → post-processing → observability and evals. Build that pipeline once; then new LLM features plug into it instead of re-inventing failure modes.