Mixture of Experts: Definition & Meaning — AI Wiki

一种架构,模型包含多个“专家”子网络,但每次输入只激活几个。一个 router 网络决定哪些专家对给定的 token 相关。这意味着模型总参数可以 100B+,但任意一次前向传播只用 20B。

为什么重要

MoE 就是 Mixtral、以及(据报道)GPT-4 这类模型如何获得巨大模型的质量和较小模型速度的方法。权衡是更高的内存使用(所有专家必须加载)即便计算更便宜。

Deep Dive

In a standard Transformer, every token passes through the same feedforward network (FFN) in each layer. In an MoE Transformer, that single FFN is replaced with multiple parallel FFNs — the "experts" — plus a small routing network (often called a gate) that decides which experts process each token. Typically, the gate selects the top-k experts (usually 2) and blends their outputs using the gate's softmax weights. The key insight is that total parameter count can be massive (giving the model enormous capacity to memorize and generalize), while per-token compute stays manageable because most experts are idle for any given input. Mixtral 8x7B, for example, has roughly 47B total parameters but activates only about 13B per token.

The Routing Problem

The routing mechanism is where most of the engineering complexity lives. A naive router might send all tokens to the same few experts, leaving others unused — a problem called expert collapse. To prevent this, MoE models use auxiliary load-balancing losses that penalize uneven expert utilization during training. The original Switch Transformer from Google used top-1 routing (one expert per token) and achieved impressive scaling, but most modern MoE models prefer top-2 routing for stability. Some newer approaches like DeepSeekMoE add shared experts that always activate alongside the routed ones, ensuring a baseline level of processing for every token regardless of routing decisions.

Memory Versus Compute

The trade-off that defines MoE deployment is memory versus compute. Even though only a fraction of experts are active per token, all of them must be loaded into memory. A 8x7B MoE model needs roughly the same memory as a dense 47B model, even though it runs at roughly the speed of a 13B dense model. This makes MoE models awkward for consumer hardware — if you can only fit 13B parameters in your GPU VRAM, you would get the same inference speed from a dense 13B model without the MoE overhead. MoE really shines when you have enough memory to hold the full model and want maximum quality per FLOP. This is why it is a natural fit for cloud serving: providers like OpenAI and Mistral can provision enough memory on their clusters, and per-request compute cost is what drives their margins.

Expert parallelism is a deployment pattern specific to MoE. In a multi-GPU setup, you can place different experts on different GPUs so each device only holds and computes a subset of experts. Tokens get routed across GPUs based on which experts they need, processed, and results are gathered back. This introduces all-to-all communication overhead but allows models far too large for a single device to run efficiently. Google's GShard and Switch Transformer papers demonstrated this at scale, and it is how the largest MoE models are served in production today.

Operational Gotchas

One nuance practitioners run into: MoE models can behave unpredictably during fine-tuning. If you fine-tune on a narrow domain, the router might start funneling all tokens to a small subset of experts, effectively wasting the capacity of the rest. Some teams freeze the router during fine-tuning; others add extra regularization. Quantization is another gotcha — experts that activate rarely have fewer calibration samples, so naive post-training quantization can degrade them disproportionately. The field is actively working through these operational challenges, but MoE is clearly the direction things are heading. Grok, DBRX, Mixtral, and almost certainly GPT-4 all use it, and the efficiency argument only gets stronger as model sizes grow.

Mixture of Experts

为什么重要

Deep Dive

The Routing Problem

Memory Versus Compute

Operational Gotchas

相关概念