Ollama taps Apple’s MLX framework to make local AI models faster on Macs

Support for NVIDIA’s NVFP4 format also allows larger models to run under tighter hardware constraints.

Mar 31st, 2026 8:00am by Paul Sawers

Featued image for: Ollama taps Apple’s MLX framework to make local AI models faster on Macs

Doodis Creativelabs For Unsplash+

Running large language models (LLMs) locally has often meant accepting slower speeds and tighter memory limits. Ollama’s latest update, built on Apple’s MLX framework, goes some way toward easing those constraints – especially for developers running AI agents directly on their machines.

In tandem, the release also introduces support for NVIDIA’s NVFP4 format, which targets memory efficiency for larger models.

For context, Ollama is runtime for LLMs with an open core that can be run locally, with a growing catalogue of open-weight models from major AI labs such as Meta, Google, Mistral, and Alibaba, which can be downloaded and run on a developer’s own machine or private infrastructure. It also integrates with coding agents, assistants, and developer tools, allowing those tools to run on locally hosted models instead of relying solely on external APIs.

Local speed gains

News emerged in early 2025 that Ollama was developing support for MLX, an open source machine learning framework Apple introduced in 2023 to run models efficiently on Apple Silicon. Its core feature — and that of Apple’s modern hardware — is a shared memory model that allows CPU and GPU workloads to operate on the same data without the usual transfer overhead, reducing latency and improving throughput during inference.

Ollama is now officially plugging directly into that architecture with its latest release. In its announcement on Monday, the company points to improvements in both responsiveness and generation speed, particularly for coding-focused models.

***MLX boosts responsiveness and generation speed***

The update also introduces changes such as more efficient caching and support for newer quantization formats, which help reduce latency during interactive use.

These improvements make local models more responsive during everyday use. Running models locally avoids sending data to external services and gives developers tighter control over how systems are deployed. And by improving how those models run on Apple hardware, Ollama is making that setup more viable for everyday development work.

Right now, MLX model support is limited to the new Qwen3.5-35B-A3B model, but others will surely follow soon.

**Local agent runtimes available in Ollama’s CLI**

OpenClaw and the shift toward local agents and models

The timing of the MLX update aligns with a surge of interest in agent-style systems that operate on a user’s machine. OpenClaw is probably the most notable example of late, climbing GitHub’s rankings and passing long-established open source projects in star count within a matter of months.

OpenClaw serves as a local AI assistant that can interact with messaging platforms, files, and external tools, executing tasks directly on a user’s machine. Its growth reflects demand for systems that do more than generate text, instead carrying out tasks across different environments. And while OpenClaw can use remote models, many users prefer to run them locally. But that tends to be significantly slower (but also cheaper) than calling a remote model over an API.

The project’s rapid growth has also brought scrutiny. Security researchers have identified real risks tied to how agent systems operate: making decisions at runtime, chaining tools together, and interacting across multiple services and permission layers. This creates exposure to issues such as data leakage and prompt injection, particularly where controls are limited or poorly defined.

Still, there’s no denying the appeal. A local agent can act across tools without relying on external APIs, giving users direct control over how tasks are executed and where data is processed. And with Ollama now integrating MLX, that setup with a local model becomes faster and more responsive on Apple hardware.

The Nvidia factor

Alongside this, Ollama has also added support for NVIDIA’s prorpietary NVFP4 format, a “low-precision inference” format designed to reduce memory usage and bandwidth while maintaining model accuracy.

NVFP4 compresses model weights more efficiently than formats such as FP16, allowing larger models to run under tighter hardware constraints. Models optimized in NVFP4 can produce outputs closer to those used in production systems, while still running on a developer’s own machine.

Together, these changes point to a shift in how and where AI systems are run. MLX improves performance on Apple hardware, while NVFP4 reduces the cost of running larger models. Ollama packages both into a single runtime, with tools like OpenClaw sitting on top to automate real-world tasks.

The result is a local-first stack that is becoming easier to run and closer to production-grade usage, particularly where control over data and execution are imperatives.

Paul is an experienced technology journalist covering some of the biggest stories from Europe and beyond, most recently at TechCrunch where he covered startups, enterprise, Big Tech, infrastructure, open source, AI, regulation, and more. Based in London, these days Paul...