Evals before code: a practical handbook for AI builds.

Evals before code: a practical handbook for AI builds.

How to build an evaluation harness before you build the AI system. Held-out test sets, scoring rubrics, tool selection (Inspect AI, Braintrust, Promptfoo, LangSmith), production observability, three worked examples at different scales, anti-patterns, and what month-four operational reality looks like.

The demo is always good. The version still running in month four is always less impressive than the demo. The thing that decides which version you end up with is almost never the thing that gets shown in the demo: it’s the evaluation suite. Build it before the system, not after.

This guide is the practical handbook we’d hand a new technical lead before they ship their first AI system in production. It covers held-out test set design (where cases come from, how many to write, what makes them representative), scoring rubrics that work (binary vs. graded vs. multi-dimensional, calibrating LLM-as-judge), tool selection across the 2026 eval landscape (Inspect AI, Braintrust, Promptfoo, LangSmith, Galileo, and who acquired whom this year), production observability (sample-and-score patterns, alerting, the tools that handle it), three worked eval setups at different scales, common anti-patterns, and the operational reality of keeping an eval suite alive past month four. Includes a starter template you can adapt for your own system.

In the guide:

01 · Why evals before code (and three failure patterns worth studying)
02 · Designing a held-out test set
03 · Scoring rubrics that work
04 · Tooling: where each tool fits (Inspect AI, Braintrust, Promptfoo, LangSmith, Galileo)
05 · Production observability (Langfuse, Arize Phoenix)
06 · Three worked examples: small, medium, and large eval setups
07 · Anti-patterns
08 · Operational reality: what month four looks like
Appendix A · Eval starter template
Appendix B · About Oasium AI

Evals before code: a practical handbook for AI builds.

Evals before code: a practical handbook for AI builds.

Send me the PDF

The build-vs-buy decision matrix

Local AI: a hardware buyer's guide