← Blog

The Missing Developer Tool: An LLM Local Debugger

Why debugging LLM systems sucks, what exists today, and what a real debugger would look like.

Debugging a normal app: set a breakpoint, step through, inspect state. Debugging an LLM app: add logs, re-run the whole flow, stare at a trace in a dashboard that may or may not be in sync with your local code. It's miserable.

Why debugging LLM systems sucks

LLM calls are non-deterministic, async, and often buried inside layers of SDKs and frameworks. You don't get a call stack that says "the model returned garbage because the prompt was truncated at step 3." You get a trace after the fact—if you remembered to send it somewhere. Local iteration is slow: run, wait, scroll through logs, guess, repeat. There's no "pause here and see what the model actually received."

What exists today

Tools like Langfuse, Helicone, and others give you traces in the cloud: request/response, tokens, latency, sometimes prompt and completion. That's observability, and it's valuable. But it's not a debugger. You're not stepping through execution. You're not setting breakpoints before the LLM call and inspecting the exact payload. You're not replaying a single request with a tweaked prompt without re-running the whole app. The feedback loop is still "ship to staging, look at traces, iterate."

What a real LLM debugger would look like

  • Breakpoints around LLM calls — Pause before the request; inspect the full prompt, tool definitions, and context. Step over; see the raw response and the parsed output.
  • Replay with edits — Take a captured request, change the prompt or the model, re-run just that call. No redeploy, no full flow.
  • Structured view of the pipeline — See the chain: user input → guardrails → prompt assembly → model → post-processing. Click a step to see inputs and outputs.
  • Local-first — Works in your IDE or a local UI. No "send everything to our cloud." Optional export to your observability stack.
  • Eval in the loop — Run your evals (deterministic checks, model-as-judge) on the current request or a batch. Fail the "debug session" if a regression is detected.

We don't have that yet. The moment someone builds it well, every serious LLM team will use it. Until then, we're all adding print statements and praying the trace made it to the dashboard.