What's changed in our recommendations this quarter

Three days ago, OpenAI shipped GPT-5.5. The day after, DeepSeek pushed out a preview of V4. Anthropic shipped Sonnet 4.6 in February and Opus 4.7 ten days ago. This has been the densest quarter on the model release calendar since GPT-5 itself shipped last August.

What follows is one read of what shifted, what stayed, and what it changes for anyone making AI decisions right now. The point of writing this analysis quarterly is that recommendations decay. Saying “Claude” or “GPT” once and never revisiting is not a recommendation; it’s a stale habit.

There are four workload categories where the choice of model still meaningfully changes outcomes: long-context summarization, code generation and agent loops, cheap classification at volume, and on-prem or sovereign deployments. Below, what shifted in each.

Long-context summarization

Anthropic shipped Sonnet 4.6 on February 17 and Opus 4.7 on April 16, the latter with a 1M-token context window. For long-document work where the failure mode is “lost in the middle,” pulling the right paragraph out of a 200-page filing or a year of meeting transcripts, Opus 4.7 is visibly better than Sonnet 4.6 on the cases where the cost of being wrong is high. And the Opus premium is no longer what it was: $5 versus $3 per million input tokens, less than double, where the old Opus tier ran five times Sonnet’s price. For most long-context work, Sonnet 4.6 remains the right answer. For regulated filings, legal review, and internal evidence work, Opus 4.7 now earns the difference.

Code generation and agent loops

OpenAI shipped GPT-5.5 on April 23, with API access the day after. The headline is the 1M-token API context window (400K in Codex), but the bigger news for agent-loop work is the cached-input pricing: $0.50 per million tokens for cached input vs. $5.00 uncached. In agent loops where the same system prompt and tool schemas get sent on every step, that is a step change in unit economics.

The implication: cost-per-run that was a blocker on previous-generation models likely is not anymore. Anyone scoping a new agent loop right now should re-run the cost calculation before settling on a vendor. The model that was too expensive for production volume in February may be the right answer in May.

Cheap classification and simple agents

Haiku 4.5 has been out since October and continues to be our default for high-volume, low-stakes work: classification, light summarization, the first-pass triage step in multi-stage pipelines. Google’s Gemini 3 Flash shipped in December and is the first Flash-class model in a while worth evaluating against it head-to-head. We haven’t run that comparison at production volume yet, so the honest answer is: undecided. Next quarter should settle it.

Open-weight and on-prem

DeepSeek pushed out a preview of V4 this week: 1M-token context, two sizes, a stable release promised later this year. Two days in, treat it as a direction rather than a verdict. The direction, though, is that the open-weight general-purpose crown keeps changing hands, from Llama 4 Maverick to DeepSeek’s V3.2 to whatever V4 stabilizes into, while the reasoning end (DeepSeek’s Speciale variant, Moonshot’s Kimi K2 Thinking, Qwen’s thinking models) stays contested. The fuller open-weight picture, for anyone weighing on-prem, is in last month’s local-AI piece.

The hardware story moved with it. NVIDIA’s DGX Spark (128GB unified memory, around 1 PFLOP at FP4) has been available since October 2025 and remains the cleanest single-box option for fine-tuning in the 30B–70B range. The price went from $3,999 to $4,699 in February when memory supply tightened. Last month, Dell shipped the Pro Max with NVIDIA’s GB300 Superchip, with 748GB of coherent memory and 20 petaFLOPS at FP4, substantially more headroom for inference and fine-tuning at the high end. Mac Studio with M3 Ultra is the alternative on Apple Silicon, running 70B-class inference under MLX.

What did not move

Not everything changes every quarter. Two things stayed put.

The case for local AI in sensitive-data environments has not changed in shape. Frontier API quality is improving, but the constraint on those workloads is regulatory, not technical. The regulation has not moved.

The case for “agent platforms,” the wrappers that put a UI around a model API and call themselves a product, has also not changed. They have not outperformed small custom systems on the production work that matters, and they have not addressed the integration and evaluation problems that decide whether a system still works in month four. Anthropic’s MCP, donated in December to the Linux Foundation’s new Agentic AI Foundation and adopted across OpenAI, Google, and Microsoft, has done more for tool integration than any wrapper has.

Why publish this

Recommendations decay. The test of one is whether you can be wrong about it. A recommendation you cannot be wrong about is a hedge.

Posting this quarterly, dated, with reasoning, makes the analysis falsifiable. If a model release later this year reverses something said here, the reversal will be explicit. The position is intentionally exposed.

Worth revisiting around the second week of August. By then it should be clearer whether GPT-5.5’s cost shift holds up on real agent loops, whether Gemini 3 Flash has displaced Haiku for classification at volume, and whether DeepSeek’s stable V4 lands where the preview points.