Why local AI is real now (and where it isn't)

A year ago, if someone asked whether AI could run on a machine they owned, a laptop or a Mac Studio or a dedicated workstation, the honest answer was: yes, but you would be running last year’s model and you would know it. The productivity gap to a frontier API was wide enough that the question usually answered itself.

That answer has changed. Not for everything (the gap is still real on some workloads), but the line moved meaningfully through 2025 and into 2026. This post is about where the line sits now, what the two different things people mean by “local AI” really are, and who actually benefits from each.

Local AI is now a real production option for a specific set of workloads. It is still not the right choice for everything. The line moved.

What “local” actually means here

Two different things often get folded under “local AI.” They solve different problems and run on different hardware. The distinction matters before any other decision gets made.

The first is local model weights: the actual neural network running on hardware that the user or organization controls. Open-weight models like Llama 4, Qwen 3, or DeepSeek V3.2, downloaded and then served on a Mac Studio, an NVIDIA workstation, or a dedicated server. No data leaves the building. This post is mostly about this version.

The second is a local agent that uses hosted models. The agent runs on the user’s computer, where it reads files, controls a browser, and executes tools. The underlying intelligence, though, is a frontier API call to Claude, GPT, or Gemini. Claude Code, OpenAI’s Codex app, the open-source OpenCUA project, and NVIDIA’s recently announced NemoClaw / OpenShell stack all sit here. The agent is local; the model usually is not.

Both are valid, and both are sometimes the right answer. Most “should we go local?” confusion comes from conflating them.

What changed on the model side

Open-weight models have closed most of the quality gap with frontier APIs on the workloads that matter most for on-prem deployments. Three releases stand out.

Llama 4 Maverick, Meta’s April 2025 release, uses a sparse mixture-of-experts design where 17B parameters are active on any forward pass despite roughly 400B total weights in memory. The result is strong general performance with inference cost that fits on a single high-spec workstation. On MMLU-Pro, the harder successor to the standard multiple-choice knowledge benchmark, the instruction-tuned Maverick scores 80.5. On most general-purpose tasks, the gap to a closed-API frontier model is small enough not to be the binding constraint.

Qwen 3 235B, Alibaba’s flagship, set the open-weight bar for reasoning through the back half of 2025, strongest in its July “Thinking” refresh. On GPQA Diamond, a graduate-level science benchmark designed to be Google-proof, it scores 81.1; on the AIME 2025 math suite, 92.3. For on-prem reasoning workloads like internal scheduling, multi-step legal analysis, and technical Q&A over engineering documentation, the Qwen line is a default candidate, though no longer the only one.

DeepSeek V3.2, released in December, made the architectural step toward usable ultra-long context on the open side. DeepSeek reports gold-medal-level results on the 2025 IMO and IOI for the high-compute Speciale variant, and positions it against Gemini 3 Pro on reasoning contests. For document-heavy on-prem workloads like research libraries, legal archives, and compliance histories, V3.2 is the open-weight default.

The list keeps moving. Moonshot’s Kimi K2 Thinking took the open-weight lead on several reasoning benchmarks in November, and Alibaba answered with Qwen3.5 in February. The point is not that any of these match the closed frontier on every task. They do not. The point is that for the specific workloads encountered most in on-prem work, the open-weight options are now good enough that quality is no longer the binding constraint.

What changed on the hardware side

NVIDIA’s DGX Spark shipped in October 2025 at $3,999. It is a desktop AI workstation built around the Grace Blackwell GB10 superchip, with 128GB of unified memory (packaged on the same module as the chip, so the model gets much faster access than it would from separate RAM) and roughly 1 petaFLOP of compute at FP4, the 4-bit precision that lets large models fit into less memory without much loss in quality. By NVIDIA’s own numbers, that is enough to fine-tune models in the 30B–70B parameter range and run inference on models up to roughly 200B. The price went up to $4,699 in February when memory supply tightened.

This month, Dell shipped the Pro Max with NVIDIA’s GB300 Grace Blackwell Ultra Superchip, the first OEM desktop with the new chip. 748GB of coherent memory, 20 petaFLOPS at FP4. That is roughly six times the memory and twenty times the throughput of the DGX Spark. It targets a different workload class: the place where the alternative would otherwise be a small server room. It also ships ready to run NVIDIA’s NemoClaw and OpenShell, an open-source stack for always-on, sandboxed local AI agents, which puts a real agent-runtime story on a real piece of hardware that fits under a desk.

Apple’s Mac Studio with M3 Ultra is the alternative on the Apple side. Configured up to 512GB unified memory, it runs 70B-class models on Apple Silicon under MLX (Apple’s machine-learning framework, equivalent in role to NVIDIA’s CUDA stack) at usable speeds. The integrated-memory architecture matters more than raw FLOPS for most inference patterns. Supply-chain reporting points to an M5-generation refresh later in 2026, though Apple has announced nothing, and the current generation is already enough for most on-prem inference today.

For shops standing up multi-user setups rather than single-operator workstations, vLLM on a quantized 70B model running on a pair of consumer-grade NVIDIA GPUs (RTX 4090 / 5090 class) handles most production workloads. Quantization, the technique that compresses model weights into smaller bit-widths, used to be a painful tradeoff, with noticeable quality loss. Recent work on 4-bit quantization has narrowed that penalty to within the noise on most tasks.

What runs locally, what doesn’t, what’s borderline

A working decision matrix for what fits on locally-hosted weights right now. The lines move every six months; this is the current state.

Yes, run it locally. Long-document summarization on a defined corpus. Structured extraction from semi-structured inputs. Internal Q&A over a known set of documents. Code completion in private repositories. First-pass classification on high-volume queues. These are the workloads where a 70B-class open-weight model on a single workstation now performs well enough that the data-control upside outweighs any marginal quality difference.

Maybe local, case by case. Bounded agentic loops with well-defined tools and short context. Fine-tuned domain Q&A where there is enough labeled data to make the fine-tune worthwhile. Anything where the workload is steady-state and the cost-per-call calculation favors amortizing hardware over per-token API spend.

Probably not local, though a local agent might still be the right answer. Frontier reasoning that benefits from the latest closed-model capabilities: the kind of work where the most recent Claude, GPT, or Gemini release is meaningfully better than the best open-weight option. Multi-hour goal-driven agents. Multimodal video at frontier quality. For these workloads, running open weights locally is still a downgrade. But the agent that orchestrates the work (Claude Code, the Codex app, an OpenCUA agent, a NemoClaw / OpenShell setup) can still run on hardware the user controls, calling out to hosted APIs only for the parts that need them. That is a meaningfully different kind of “local,” and increasingly the right shape for users who want hands-on control without giving up frontier quality.

The matrix changes. It will look different by summer.

Who actually benefits

Local AI is rarely the right answer because it’s cheaper. It’s the right answer because of control. Several distinct groups make that calculation differently.

Sensitive-data environments. Research labs working with subject data. Healthcare providers with patient records. Government with classified material. The regulatory line is often not negotiable. Local is the only way to get useful AI inside it.

Sovereignty-driven procurement. Some regions and industries treat data sovereignty as a default rather than an exception. The Gulf is one example, where local-first AI is increasingly part of the procurement conversation; large parts of European public-sector and defense procurement are similar. The question for these buyers is rarely “is local as good as cloud?” It is “what works, given that cloud is not an option?”

Offline-by-design contexts. Drone operations in regions without reliable connectivity. Autonomous mobility with onboard inference. Vessels, rigs, vehicles, and field deployments without high-bandwidth uplinks. The workload shape simply requires local.

Solo operators and small businesses. A single owner running their own books, a small consultancy automating internal workflows, an independent professional building an AI assistant against their own files. For this group, cost is often the deciding factor: running a 70B-class model on a Mac Studio or DGX Spark amortizes over months, while equivalent API spend at the same query volume can be substantial. Privacy is a secondary motivator.

Personal AI builders. People who want an AI assistant on their own machine: running while offline, learning from their own files, executing tools without going through a third party. The hardware to run a usable system at home is now broadly available; the software stack improves each month. For this group, local AI is less about cost or compliance and more about owning the workflow.

What this changes

For most use cases, a frontier API is still the simplest and best choice. Local AI is not a default.

It is the right answer for the workloads that fit one of the patterns above, and where the people deploying it have the engineering discipline to maintain their own infrastructure. For that set, it is no longer a downgrade. The line is in a different place this year, and every six months the line moves further toward the local side.

We will be wrong about parts of this in six months. The hardware roadmap moves. The model roadmap moves. The pattern, though, is durable: the set of workloads where local is a real production option gets larger, not smaller.