mKernel: UCCL fuses NVLink, RDMA and compute into one persistent CUDA kernel

UC Berkeley's UCCL project released mKernel, an MIT-licensed library that fuses intra-node NVLink communication, inter-node RDMA, and dense compute into single persistent CUDA kernels — running them simultaneously instead of as sequential stages. The motivating number is the one frontier-training teams know: communication consumes 43.6% of the forward pass and 32% of end-to-end training time, rising to as much as 47% of total execution for MoE models where expert-parallel all-to-all dominates. If nearly half your training time is the network waiting on compute or compute waiting on the network, the kernel boundary between them is where the waste lives, and mKernel's bet is to remove that boundary.

Two mechanisms drive the gain. First, CPUs are not scaling with GPUs — every separate kernel launch and synchronization check costs microseconds that have become measurable pipeline delays on H100/H200-class hardware, and a persistent fused kernel pays that cost once instead of per-stage. Second, fusion enables fine-grained intra-kernel overlap at tile/chunk granularity: instead of finishing all communication then starting compute (or vice versa) at coarse kernel boundaries, mKernel interleaves them inside one kernel so a tile that has arrived over RDMA feeds GEMM while the next tile is still in flight. The library ships five fused kernels covering the patterns that matter: AllGather+GEMM, GEMM+AllReduce, MoE Dispatch+GEMM, Ring Attention, and GEMM+ReduceScatter — token routing, expert parallelism, and sequence-parallel attention. Tested on 2-node H200 clusters over AWS EFA or ConnectX-7 InfiniBand.

The ecosystem read: communication-compute fusion is the next efficiency frontier for multi-node training now that single-GPU kernel optimization is mature. NCCL and NVSHMEM treat communication as a separate primitive from compute; the persistent-kernel fusion approach is what closes the kernel-boundary overlap gap that those libraries structurally cannot. For MoE specifically — where communication is the single largest time sink at 47% — this is the highest-leverage place to optimize, which is why MoE Dispatch+GEMM is one of the five shipped kernels. The structural signal is that this came from academia under MIT license, not from a vendor — NVIDIA's DeepEP and NVSHMEM are the closest comparison, and an open MIT alternative changes who can build on comm-compute fusion without vendor lock-in.

The honest caveats: the writeup gives no head-to-head speedup numbers against NCCL or DeepEP, the testing is 2-node H200 only (multi-node-at-scale behavior is the open question), and persistent fused kernels are notoriously hard to debug and tune. If you train MoE or large models multi-node Monday morning: mKernel is worth a benchmark on your own fabric, especially if communication is your measured bottleneck — but profile your comm fraction first, reproduce on your node count, and treat the absence of published NCCL comparisons as the thing to verify before betting a training run on it.

mKernel: UCCL fuses NVLink, RDMA and compute into one persistent CUDA kernel

More News