Skip to content
Evaluation · Apr 26, 2026 · 5 min read

What's changed in our recommendations this quarter

April 2026 has been a denser stretch on the model release calendar than any quarter since GPT-5 itself shipped last August. One read of what shifted, what stayed, and what it changes for anyone making AI decisions right now.

Three days ago, OpenAI shipped GPT-5.5. Two months ago, Anthropic shipped Sonnet 4.6. In between, Opus 4.7 launched, DeepSeek shipped V4, and Google rolled out Gemini 3 Flash. April 2026 has been a denser stretch on the model release calendar than any single quarter since GPT-5 itself shipped last August.

What follows is one read of what shifted, what stayed, and what it changes for anyone making AI decisions right now. The point of writing this kind of analysis quarterly — dated, with reasoning, falsifiable later — is that recommendations decay. Saying “Claude” or “GPT” once and never revisiting is not a recommendation; it’s a stale habit.

There are five workload categories where the choice of model still meaningfully changes outcomes: long-context summarization, code generation and agent loops, cheap classification at volume, on-prem and sovereign deployments, and frontier reasoning. Below, what shifted in each.

Long-context summarization

Anthropic shipped Sonnet 4.6 on February 17 and Opus 4.7 in April — the latter with a 1M-token context window. For long-document work where the failure mode is “lost in the middle” — pulling the right paragraph out of a 200-page filing or a year of meeting transcripts — Opus 4.7 is visibly better than Sonnet 4.6 on the cases where the cost of being wrong is high. The Opus tier is roughly five times more expensive per task. For most long-context work, Sonnet 4.6 remains the right answer. For regulated filings, legal review, and internal evidence work, Opus is now the right answer.

Code generation and agent loops

OpenAI shipped GPT-5.5 on April 24. The headline is the 1M-token API context window (400K in Codex), but the bigger news for agent-loop work is the cached-input pricing: $0.50 per million tokens for cached input vs. $5.00 uncached. In agent loops where the same system prompt and tool schemas get sent on every step, that is a step change in unit economics.

The implication: cost-per-run that was a blocker on previous-generation models likely is not anymore. Anyone scoping a new agent loop right now should re-run the cost calculation before settling on a vendor. The model that was too expensive for production volume in February may be the right answer in May.

Cheap classification and simple agents

Haiku 4.5 has been out since October 2025 and continues to be the best default for the high-volume, low-stakes work — classification, light summarization, the first-pass triage step in multi-stage pipelines. Google’s Gemini 3 Flash shipped this quarter and is the first Flash-class model in a while worth taking seriously enough to evaluate against Haiku. The honest answer for now: too early to call. The next quarter will tell.

Open-weight and on-prem

DeepSeek introduced V4 earlier this month, a meaningful architectural step over V3.2 with usable ultra-long context. For on-prem and sovereign deployments — research labs, healthcare, government, regulated environments — V4 is now the strongest open-weight option for general-purpose work, ahead of Llama 4 Maverick (still strong on raw MMLU at 85.5%) and Qwen 3 235B (still the better pick when the workload is reasoning-heavy: math, complex scheduling, multi-step legal analysis, where it leads on GPQA Diamond at 77.2% and AIME ‘24 at 85.7%).

The hardware story moved with it. NVIDIA’s DGX Spark — 128GB unified memory, around 1 PFLOP at FP4 — has been available since October 2025 and remains the cleanest single-box option for fine-tuning in the 30B–70B range. The price went from $3,999 to $4,699 in February when memory supply tightened. Last month, Dell shipped the Pro Max with NVIDIA’s GB300 Superchip — 748GB of coherent memory and 20 petaFLOPS at FP4 — substantially more headroom for inference and fine-tuning at the high end. Mac Studio with M3 Ultra is the alternative on Apple Silicon, running 70B-class inference under MLX.

What did not move

Two things worth noting. Not everything changes every quarter.

The case for local AI in sensitive-data environments has not changed in shape. Frontier API quality is improving, but the constraint on those workloads is regulatory, not technical. The regulation has not moved.

The case for “agent platforms” — the wrappers that put a UI around a model API and call themselves a product — has also not changed. They have not outperformed small custom systems on the production work that matters, and they have not addressed the integration and evaluation problems that decide whether a system still works in month four. Anthropic’s MCP — donated to the Linux Foundation in December, now adopted across OpenAI, Google, and Microsoft — has done more for tool integration than any wrapper has.

Why publish this

Recommendations decay. The interesting question is whether you can be wrong about one. A recommendation you cannot be wrong about is not a recommendation; it is a hedge.

Posting this quarterly, dated, with reasoning, makes the analysis falsifiable. If a model release later this year reverses something said here, the reversal will be explicit. The position is intentionally exposed.

Worth revisiting around the second week of August. By then it should be clearer whether GPT-5.5’s cost shift holds up on real agent loops, whether Gemini 3 Flash has displaced Haiku for classification at volume, and what the next DeepSeek release does to the open-weight stack.

— Oasium AI · Applied AI consulting
← All writing
More from us

If this was useful, the guides go deeper.

Long-form work — frameworks, scoring sheets, and worked examples — free to read and download. Subscribe and we'll send new ones as they ship.