PyTorch MLX Delegate: 3-6× faster generative AI on Apple Silicon, 90 ATen ops, Zubnet AI News

PyTorch published the ExecuTorch MLX Delegate on May 18 — a new ExecuTorch backend that compiles and runs PyTorch models on Apple Silicon GPUs via Apple's MLX framework. Reported 3-6× higher throughput on generative AI workloads compared to existing ExecuTorch delegates on macOS. The toolchain is standard: export with `torch.export`, lower with `to_edge_transform_and_lower` using the `MLXPartitioner`, run the resulting `.pte` file with the ExecuTorch runtime. Supported model coverage: Llama 3.2 1B, Qwen 3 (0.6B, 1.7B, 4B), Phi-4 mini (3.8B), Gemma 3 (1B, 4B), Qwen 3.5 35B-A3B Mixture-of-Experts, plus speech models Whisper, NVIDIA Parakeet TDT, and Mistral Voxtral with offline and real-time streaming. The delegate is experimental and under active development. github.com/pytorch/executorch.

The architectural significance is the bridge between PyTorch's export stack and Apple's native ML runtime. Before this, running PyTorch models on Mac meant: PyTorch's MPS backend (Metal, decent but not best-in-class), conversion to CoreML (Apple-native but requires the conversion pipeline), llama.cpp or Ollama (separate runtime, not PyTorch ecosystem), or MLX directly (Apple's framework but requires rewriting the model). The MLX Delegate lets you stay in PyTorch land — same `torch.export`, same TorchAO quantization, same ExecuTorch runtime — and get Apple-native GPU performance through MLX's Metal kernels. The 90 ATen ops the delegate currently supports is the gating constraint: anything that decomposes to those ops runs; custom ops or unsupported decompositions fall back to other paths or fail.

Position this in the on-device-AI infrastructure stack. Apple's Foundation Models and CoreML cover Apple-native inference; llama.cpp and Ollama dominate quantized-LLM execution on consumer hardware; MLX is Apple's array framework. The MLX Delegate makes PyTorch a first-class citizen on Mac for generative AI, with the same toolchain Linux/server users have. The 3-6× number is against existing ExecuTorch macOS delegates specifically — not against MPS, not against CoreML, not against llama.cpp. The honest comparison would be MLX-Delegate vs Ollama for the same model on the same Mac; that benchmark isn't in the writeup. What is concrete: the MoE coverage (Qwen 3.5 35B-A3B) is rare for on-device runtimes, and the real-time speech streaming support (Voxtral) is non-trivial to engineer.

Monday: if you ship PyTorch models that need to run on Mac in the consumer or developer-machine context, the MLX Delegate is the export path to try — start with one of the supported model families (Llama, Qwen, Phi, Gemma) and benchmark against your current MPS or CoreML path. If you maintain custom ops that decompose to ATen primitives, check whether your decomposition fits in the 90-op support set; if it doesn't, you'll get partial offload at best. The experimental tag matters: APIs and supported features will change, so don't bake the MLX Delegate into a load-bearing production path yet. The longer-term question is whether MLX becomes the default GPU backend for PyTorch on Mac — that depends on the delegate's stability trajectory and whether Apple contributes deeper to the upstream PyTorch repo. Watch the ExecuTorch GitHub for promotion from experimental to stable in the next 2-3 ExecuTorch releases.

PyTorch MLX Delegate: 3-6× faster generative AI on Apple Silicon, 90 ATen ops

More News