Skip to content
Local AI · Mar 26, 2026 · 9 min read

Why local AI is real now (and where it isn't)

The honest answer about local AI changed sometime in the past year. Here is what runs locally now, what doesn't, what 'local' really means, and who actually benefits.

A year ago, if someone asked whether AI could run on hardware they owned — a laptop, a desktop, a Mac Studio, a dedicated workstation — the honest answer was: yes, but you would be running last year’s model and you would know it. The productivity gap to a frontier API was wide enough that the question usually answered itself.

That answer has changed. Not for everything — the gap is still real on some workloads — but the line moved meaningfully through 2025 and into 2026. The point of this post is to say where the line is now, what the two different things people mean by “local AI” really are, and who actually benefits from each.

Local AI is now a real production option for a specific set of workloads. It is still not the right choice for everything. The line moved.

What “local” actually means here

Two different things often get folded under “local AI.” They solve different problems and run on different hardware. The distinction matters before any other decision gets made.

The first is local model weights — the actual neural network running on hardware that the user or organization controls. Open-weight models like Llama 4, Qwen 3, or DeepSeek V3.2/V4 — downloaded, then served on a Mac Studio, an NVIDIA workstation, or a dedicated server. No data leaves the building. This post is mostly about this version.

The second is a local agent that uses hosted models. The agent runs on the user’s computer — reading files, controlling a browser, executing tools — but the underlying intelligence is a frontier API call to Claude, GPT, or Gemini. Claude Code, OpenAI’s Codex Desktop, the open-source OpenCUA project, and NVIDIA’s recently-announced NemoClaw / OpenShell stack all sit here. The agent is local; the model usually is not.

Both are valid. Both are sometimes the right answer. Conflating them is the source of most “should we go local?” confusion in procurement conversations.

What changed on the model side

Open-weight models have closed most of the quality gap with frontier APIs on the workloads that matter most for on-prem deployments. Three releases stand out.

Llama 4 Maverick — Meta’s release in 2025 — uses a sparse mixture-of-experts design where 17B parameters are active on any forward pass despite roughly 400B total weights in memory. The result is frontier-grade general performance with inference cost that fits on a single high-spec workstation. On MMLU — the standard knowledge benchmark, a multiple-choice test covering 57 academic and professional subjects — Maverick scores 85.5%. On most general-purpose tasks, the gap to a closed-API frontier model is small enough not to be the binding constraint.

Qwen 3 235B — Alibaba’s release — leads on reasoning-heavy work. On GPQA Diamond, a graduate-level science benchmark designed to be Google-proof, it scores 77.2%. On the AIME math suite, 85.7%. For on-prem reasoning workloads — internal scheduling, multi-step legal analysis, technical Q&A over engineering documentation — Qwen 3 is the strongest open-weight option.

DeepSeek V3.2, released late last year, made the architectural step toward usable ultra-long context on the open side. The Speciale variant matches Gemini 3 Pro on math and coding contests and earned gold-medal results in the 2025 IMO and IOI. For document-heavy on-prem workloads — research libraries, legal archives, compliance histories — V3.2 is the open-weight default.

The point is not that any of these match the closed frontier on every task. They do not. The point is that for the specific workloads encountered most in on-prem work, the open-weight options are now good enough that quality is no longer the binding constraint.

What changed on the hardware side

NVIDIA’s DGX Spark shipped in October 2025 at $3,999. It is a desktop AI workstation built around the Grace Blackwell GB10 superchip — 128GB of unified LPDDR5x memory (a low-power form of DDR5 memory packaged on the same module as the chip, which gives the model much faster access than separate RAM would), around 1,000 TOPS (trillion operations per second) of inference, and roughly 1 petaFLOP at FP4 precision (4-bit floating-point — a compression scheme that lets large models fit into less memory without much loss in quality). That is enough to fine-tune in the 30B–70B parameter range and run inference on models up to roughly 200B. NVIDIA raised the price to $4,699 in February when memory supply tightened.

This month, Dell shipped the Pro Max with NVIDIA’s GB300 Grace Blackwell Ultra Superchip — the first OEM desktop with the new chip. 748GB of coherent memory, 20 petaFLOPS at FP4. That is roughly six times the memory and twenty times the throughput of the DGX Spark. It targets a different workload class: the place where the alternative would otherwise be a small server room. It also ships ready to run NVIDIA’s NemoClaw and OpenShell — an open-source stack for always-on, sandboxed local AI agents — putting a real agent-runtime story on a real piece of hardware that fits under a desk.

Apple’s Mac Studio with M3 Ultra is the alternative on the Apple side. Configured up to 512GB unified memory, it runs 70B-class models on Apple Silicon under MLX (Apple’s machine-learning framework, equivalent in role to NVIDIA’s CUDA stack) at usable speeds. The integrated-memory architecture matters more than raw FLOPS for most inference patterns. Apple has signaled an M5 Ultra Mac Studio later in 2026; the current generation is already enough for most on-prem inference today.

For shops standing up multi-user setups rather than single-operator workstations, vLLM on a quantized 70B model running on consumer-grade NVIDIA hardware (RTX 4090 / 5090 class GPUs) handles most production workloads. Quantization — the technique that compresses model weights into smaller bit-widths — used to be a painful tradeoff, with noticeable quality loss. Recent work on 4-bit quantization has narrowed that penalty to within the noise on most tasks.

What runs locally, what doesn’t, what’s borderline

A working decision matrix for what fits on locally-hosted weights right now. The lines move every six months; this is the current state.

Yes, run it locally. Long-document summarization on a defined corpus. Structured extraction from semi-structured inputs. Internal Q&A over a known set of documents. Code completion in private repositories. First-pass classification on high-volume queues. These are the workloads where a 70B-class open-weight model on a single workstation now performs well enough that the data-control upside outweighs any marginal quality difference.

Maybe local — case by case. Bounded agentic loops with well-defined tools and short context. Fine-tuned domain Q&A where there is enough labeled data to make the fine-tune worthwhile. Anything where the workload is steady-state and the cost-per-call calculation favors amortizing hardware over per-token API spend.

Probably not local — but a local agent might still be the right answer. Frontier reasoning that benefits from the latest closed-model capabilities — the kind of work where the most recent Claude, GPT, or Gemini release is meaningfully better than the best open-weight option. Multi-hour goal-driven agents. Multimodal video at frontier quality. For these workloads, running open weights locally is still a downgrade. But the agent that orchestrates the work — Claude Code, Codex Desktop, an OpenCUA agent, a NemoClaw / OpenShell setup — can still run on hardware the user controls, calling out to hosted APIs only for the parts that need them. That is a meaningfully different kind of “local,” and increasingly the right shape for users who want hands-on control without giving up frontier quality.

The matrix changes. It will look different by summer.

Who actually benefits

Local AI is rarely the right answer because it’s cheaper. It’s the right answer because of control. Several distinct groups make that calculation differently.

Sensitive-data environments. Research labs working with subject data. Healthcare providers with patient records. Government with classified material. The regulatory line is often not negotiable. Local is the only way to get useful AI inside it.

Sovereignty-driven procurement. Some regions and industries treat data sovereignty as a default rather than an exception. The Gulf is one example, where local-first AI is increasingly part of the procurement conversation; large parts of European public-sector and defense procurement are similar. The question for these buyers is rarely “is local as good as cloud?” — it is “what works given that cloud is not an option?”

Offline-by-design contexts. Drone operations in regions without reliable connectivity. Autonomous mobility with onboard inference. Vessels, rigs, vehicles, and field deployments without high-bandwidth uplinks. The workload shape simply requires local.

Solo operators and small businesses. A single owner running their own books, a small consultancy automating internal workflows, an independent professional building an AI assistant against their own files. For this group, cost is often the deciding factor — running a 70B-class model on a Mac Studio or DGX Spark amortizes over months, while equivalent API spend at the same query volume can be substantial. Privacy is a secondary motivator.

Personal AI builders. People who want an AI assistant on their own machine — running while offline, learning from their own files, executing tools without going through a third party. The hardware to run a usable system at home is now broadly available; the software stack improves each month. For this group, local AI is less about cost or compliance and more about owning the workflow.

What this changes

For most use cases, a frontier API is still the simplest and best choice. Local AI is not a default.

It is the right answer for the workloads that fit one of the patterns above — and where the people deploying it have the engineering discipline to maintain their own infrastructure. For that set, it is no longer a downgrade. The line is in a different place this year, and every six months the line moves further toward the local side.

We will be wrong about parts of this in six months. The hardware roadmap moves. The model roadmap moves. The pattern, though, is durable: the set of workloads where local is a real production option gets larger, not smaller.

— Oasium AI · Applied AI consulting
← All writing
More from us

If this was useful, the guides go deeper.

Long-form work — frameworks, scoring sheets, and worked examples — free to read and download. Subscribe and we'll send new ones as they ship.