The demo is always good.
We have sat through enough demos — as buyers ourselves, and as the consultants called in to clean up after them — to know that the demo is always good. The agent does the thing. The model writes the report. The dashboard updates. Everyone leaves the meeting impressed. Six months later the system is either working or it is not, and what decided that was almost never the part of the work that anyone showed in the demo.
Roughly half of every AI engagement we run is unglamorous — the work nobody shows in a demo. It is the half that decides whether the system survives.
There are four parts. None of them get a screenshot in the proposal. All of them are in the runbook by the end.
Eval suites built on the work, not on benchmarks
Public benchmarks tell you about the model. MMLU-Pro, GPQA Diamond, SWE-bench — they tell you whether the model can answer hundreds of multiple-choice graduate-level science questions, or fix a few thousand issues from open-source repositories. They do not tell you whether the system you are building works on your work.
What works on your work has to be measured against your work. That means a held-out set of real cases — pulled from real history, not invented in a planning meeting — with checkable answers and a scoring rubric written by someone who knows what right looks like in the domain. The set does not need to be large. Twenty to fifty cases is usually enough to detect a regression. It needs to be representative, dated, and re-runnable.
We build the eval suite first. Before the agent. Before the workflow. Before the model is selected. The reason to build it first is that an eval suite written after the system biases toward what the system already does. An eval suite written before the system tells you what the system needs to do. The set is small — usually twenty to fifty cases pulled from real history, with a scoring rubric written by someone in the domain who can recognize a right answer. The set lives with the client. They re-run it whenever a new model release crosses their bar.
The open-source Inspect AI framework, maintained by the UK AI Safety Institute, is a clean way to formalize this kind of work — define the inputs, the expected behavior, and the scoring rule, then run the same suite against any model release. Commercial platforms like Braintrust and LangSmith do the same job with a hosted UI, version control, and dashboards built for teams running several eval pipelines in parallel. None of them matter as much as the question they all answer: when the model behind your system changes underneath you, what is the alert?
Observability that survives a model swap
Production AI systems run on models that change. Sonnet 4.5 became Sonnet 4.6 just last week. GPT-5 became GPT-5 Pro in October. Gemini 3 Pro entered preview earlier this year. Whatever model your system is running, the version it ends up on this time next year will not be the version it started on.
This is the failure mode that takes most production teams by surprise. The prompt looks the same. The API contract looks the same. The system runs without errors. The output is subtly wrong, and nobody notices until a customer does.
The fix is observability that quantifies output behavior, not just system uptime. Sample some fraction of production traffic. Re-score it against the eval suite. Track the distribution of scores over time. When the distribution shifts, alert. Tools that solve this at the LLM-output layer — Helicone, Langfuse, Arize Phoenix — capture every model call your system makes and let you sample, score, and compare outputs over time. They earn their cost the first time the underlying model has a behavior change you would otherwise have missed. The alert reaches you before the customer ticket does, which is the only time the difference between a working alert and a stale dashboard ever actually matters.
Context management as engineering, not vibes
Frontier model context windows have grown. GPT-5 supports several hundred thousand tokens; Sonnet 4.6 and the latest Gemini variants comparable. The intuitive response to a larger context is to send more context. The empirical response is more careful.
Long sessions accumulate low-value history. The agent’s tenth tool call has access to the failure log of the first nine. Some of that is useful — the model can recognize that a tool has been failing and avoid it. Some of it is harmful — the model overweights the early framing of the task and underweights the most recent state. The literature on this is real and growing; the term context rot gets used informally for the late-session degradation pattern, and the Anthropic engineering team has written directly about it.
What this means for production work is that context management has to be deliberate. Most starter frameworks ship with the wrong defaults — they accumulate the full session by default and rely on context-window size to bail them out. We strip that on day one. We define what state the agent needs at each step, we summarize or drop the rest, we re-introduce earlier facts only when they are needed for the current decision. None of this is novel. All of it is unevenly applied. The engagements where context management is treated as engineering — with explicit rules, tested boundaries, and a documented strategy — are the ones whose long-running agents do not degrade in week three.
Documented handoff
The fourth part is the part most engagements skip and most clients regret skipping.
The handoff is the document — usually a runbook, sometimes a small internal site — that describes how to operate, debug, evaluate, and extend the system without us. Without us is the operative phrase. The point of the handoff is that the engagement ends on a date, and the system has to keep working.
A good handoff includes: how to run the eval suite, what the expected baseline looks like, what triggers a re-eval, what the model-swap procedure is, where the prompts live, what the rollback path is, and who to call when it breaks. The bad version of this is a slide deck. The good version is a markdown file in the same repo as the system, written so that the operator can use it on day 30 without our help, and on day 120 without remembering that we ever wrote it.
Why this is the half that matters
None of the four parts make a great demo. All four of them decide whether the system is still working in month four. The teams who succeed are the ones who made peace with the first part being unglamorous. The teams who struggle are the ones still hoping it is not.
We say this as honestly as we can in proposals: roughly half the engagement budget should sit on this side of the line. If the proposal you are reading from a competing firm spends 90% of the timeline on the build and 10% on what comes after, you are reading a proposal for a demo. Demos do not survive month four. Runbooks do.