Build Your Own LLM Eval Harness vs. Off-the-Shelf Observability

Most teams treat LLM quality the same way: ship the prompt, then bolt on whatever observability or evals they can find.

That works until it doesn't. The question isn't whether to measure—it's who owns the pipeline.

The problem

You need to know if your LLM output is good enough to ship.

"Good enough" is product-specific. For a narrative game it's continuity and choice quality. For a support bot it's accuracy and tone. For a code assistant it's correctness and style. Off-the-shelf platforms give you dashboards, traces, and sometimes generic "quality" scores. They rarely give you the right signal for your product.

So you're stuck between building something that fits but takes time, or buying something that's fast but wrong.

Why "just use a platform" falls short

Observability platforms are great at:

Tracing calls and latency
Logging tokens and cost
Surfacing errors and fallbacks

They're weak at:

Defining what "good" means for your use case
Running deterministic checks (e.g. "last event must be a choice")
Multi-turn or stateful evals (e.g. narrative continuity across 5 scenes)
Tight feedback loops: change prompt → run evals → read report in your repo

Most platforms assume you'll add custom evals or "scores" later. In practice, that means you're still building—just inside their UI and their taxonomy instead of in code you own.

Why "build everything yourself" is a trap

The other extreme is a full custom eval engine: dataset format, runner, scorers, reporting, CI integration.

That's a product. If you're not selling evals, you're maintaining a second codebase. Scorers drift. Datasets get stale. The harness that worked at 10 cases breaks at 500. Most teams underestimate how much ongoing care a good eval pipeline needs.

The real problem isn't build vs. buy. It's who defines "good" and where the feedback loop lives.

What actually works

Own the definition of good. Borrow the rest.

You define: What to measure (e.g. "every scene ends with a choice", "continuity across turns"). Write these as code: scorers, judges, or small scripts that consume LLM output and return pass/fail or a score.
You own: The eval dataset (cases, inputs, expected shape) and the runner that calls your LLM and runs your checks. That can be a simple script (e.g. load cases, call model, run scorers, print summary). No need for a fancy platform to do that.
You can still use: A platform for traces, cost, and errors. Use it for ops and debugging, not as the source of truth for "is this output good enough to ship."

So: custom evals + minimal runner in your repo, optional observability for everything else.

Concrete setup

We do this for a narrative AI:

Dataset: JSONL of cases (input + optional expected shape). One line per case. Easy to version and grow.
Runner: A script that loads cases, calls the BAML function (e.g. narrative scene), runs deterministic scorers (schema, rules like "no back-to-back dialogue") and optionally an LLM-as-judge for continuity. Writes a short report to stdout or a file.
Simulate run: A separate script that runs a multi-turn playthrough (same as prod), writes a human-readable report to a file. We (or the AI) run it, read the output, and refine prompts. No copy-paste.
CI: Run the eval script on a subset of cases; fail the build if something regresses. No dashboard required for that.

We don't use a platform to decide if a scene is valid. We use code. We might use a platform later for cost and latency; the eval harness doesn't depend on it.

Takeaways

Platforms are for ops and visibility, not for product-specific quality. Use them for traces, cost, errors—not as your eval source of truth.
You need a small eval harness anyway. Dataset + runner + your own scorers. That's the only way to get signal that matches your product.
Keep the feedback loop in repo. Scripts, reports, CI. So you can change a prompt and re-run evals without leaving your stack.
Build the minimum runner. A few hundred lines that load cases, call the LLM, run checks, print results. You can add a platform later; you can't retrofit product-specific quality into a generic one.

Build the eval logic. Borrow the rest.