The unglamorous half of every AI engagement

The demo is always good.

We have sat through enough demos, on both sides of the table, to know that the demo is always good. The agent does the thing. The model writes the report. The dashboard updates. Everyone leaves the meeting impressed. Six months later the system is either working or it is not, and what decided that was almost never the part of the work that anyone showed in the demo.

Roughly half of every AI engagement we run is unglamorous: the work nobody shows in a demo. It is the half that decides whether the system survives.

There are four parts. None of them get a screenshot in the proposal. All of them are in the runbook by the end.

Eval suites built on the work, not on benchmarks

Public benchmarks tell you about the model. MMLU-Pro, GPQA Diamond, and SWE-bench tell you whether the model can answer a couple hundred graduate-level science questions, or fix a couple thousand real issues pulled from open-source repositories. They don’t tell you whether the system you are building works on your work.

What works on your work has to be measured against your work. That means a held-out set of real cases, pulled from real history rather than invented in a planning meeting, with checkable answers and a scoring rubric written by someone who knows what right looks like in the domain. The set doesn’t need to be large. Twenty to fifty cases is usually enough to detect a regression, as long as they are representative, dated, and re-runnable.

We build the eval suite first: before the agent, before the workflow, before the model is selected. An eval suite written after the system biases toward what the system already does; written before, it describes what the system needs to do. The finished set lives with the client, who re-runs it whenever a new model release crosses their bar.

The open-source Inspect framework, maintained by the UK AI Security Institute, is a clean way to formalize this kind of work. Define the inputs, the expected behavior, and the scoring rule, then run the same suite against any model release. Commercial platforms like Braintrust and LangSmith do the same job with a hosted UI, version control, and dashboards built for teams running several eval pipelines in parallel. The tooling matters less than the question it answers: when the model behind your system changes underneath you, what is the alert?

Observability that survives a model swap

Production AI systems run on models that change. Sonnet 4.5 became Sonnet 4.6 last week. GPT-5.1 replaced GPT-5 as ChatGPT’s default in November, the same month Gemini 3 arrived in preview. Whatever model your system is running, the version it ends up on this time next year will not be the version it started on.

This is the failure mode that takes most production teams by surprise. The prompt looks the same. The API contract looks the same. The system runs without errors. The output is subtly wrong, and nobody notices until a customer does.

The fix is observability that quantifies output behavior, not just system uptime. Sample some fraction of production traffic. Re-score it against the eval suite. Track the distribution of scores over time. When the distribution shifts, alert. Tools like Helicone, Langfuse, and Arize Phoenix work at exactly this layer. They capture every model call your system makes, let you sample, score, and compare outputs over time, and earn their cost the first time the underlying model shifts in a way you would otherwise have missed. The point is for the alert to reach you before the customer ticket does.

Context management as engineering, not vibes

Frontier model context windows have grown. GPT-5 accepts 400,000 tokens; Sonnet 4.6 and Gemini 3 Pro stretch to a million in their long-context modes. The intuitive response to a larger context is to send more of it. The empirical response is more careful.

Long sessions accumulate low-value history. The agent’s tenth tool call has access to the failure log of the first nine. Some of that is useful: the model can recognize that a tool has been failing and avoid it. Some of it is harmful: the model overweights the early framing of the task and underweights the most recent state. The literature on this is real and growing; the term context rot gets used for the late-session degradation pattern, and the Anthropic engineering team has written directly about it.

What this means for production work is that context management has to be deliberate. Most starter frameworks accumulate the full session by default and lean on the context window to bail them out. We strip that on day one: define what state the agent needs at each step, summarize or drop the rest, re-introduce earlier facts only when the current decision needs them. There is nothing novel in any of that. It is just unevenly applied, and the engagements that treat context management as engineering, with explicit rules and tested boundaries, are the ones whose long-running agents don’t degrade in week three.

Documented handoff

The fourth part is the part most engagements skip and most clients regret skipping.

The handoff is the document that describes how to operate, debug, evaluate, and extend the system without us. Usually it is a runbook; sometimes it is a small internal site. Without us is the operative phrase. The point of the handoff is that the engagement ends on a date, and the system has to keep working.

A good handoff includes: how to run the eval suite, what the expected baseline looks like, what triggers a re-eval, what the model-swap procedure is, where the prompts live, what the rollback path is, and who to call when it breaks. The bad version of this is a slide deck. The good version is a markdown file in the same repo as the system, written so that the operator can use it on day 30 without our help, and on day 120 without remembering that we ever wrote it.

Why this is the half that matters

No part of this makes a great demo. Together, the four decide whether the system is still working once the novelty wears off. The teams that succeed made peace with that early. The teams that struggle are still hoping the unglamorous half is optional.

We say this as plainly as we can in proposals: roughly half the engagement budget should sit on this side of the line. If a proposal you are reading spends 90% of the timeline on the build and 10% on what comes after, you are reading a proposal for a demo. Demos do not survive month four. Runbooks do.