Mixture of Experts: Definition & Meaning — AI Wiki

一種架構,模型包含多個「專家」子網路,但每次輸入只啟動幾個。一個 router 網路決定哪些專家對給定的 token 相關。這意味著模型總參數可以 100B+,但任意一次前向傳播只用 20B。

為什麼重要

MoE 就是 Mixtral、以及(據報導)GPT-4 這類模型如何獲得巨大模型的品質和較小模型速度的方法。取捨是更高的記憶體使用(所有專家必須載入)即便運算更便宜。

Deep Dive

In a standard Transformer, every token passes through the same feedforward network (FFN) in each layer. In an MoE Transformer, that single FFN is replaced with multiple parallel FFNs — the "experts" — plus a small routing network (often called a gate) that decides which experts process each token. Typically, the gate selects the top-k experts (usually 2) and blends their outputs using the gate's softmax weights. The key insight is that total parameter count can be massive (giving the model enormous capacity to memorize and generalize), while per-token compute stays manageable because most experts are idle for any given input. Mixtral 8x7B, for example, has roughly 47B total parameters but activates only about 13B per token.

The Routing Problem

The routing mechanism is where most of the engineering complexity lives. A naive router might send all tokens to the same few experts, leaving others unused — a problem called expert collapse. To prevent this, MoE models use auxiliary load-balancing losses that penalize uneven expert utilization during training. The original Switch Transformer from Google used top-1 routing (one expert per token) and achieved impressive scaling, but most modern MoE models prefer top-2 routing for stability. Some newer approaches like DeepSeekMoE add shared experts that always activate alongside the routed ones, ensuring a baseline level of processing for every token regardless of routing decisions.

Memory Versus Compute

The trade-off that defines MoE deployment is memory versus compute. Even though only a fraction of experts are active per token, all of them must be loaded into memory. A 8x7B MoE model needs roughly the same memory as a dense 47B model, even though it runs at roughly the speed of a 13B dense model. This makes MoE models awkward for consumer hardware — if you can only fit 13B parameters in your GPU VRAM, you would get the same inference speed from a dense 13B model without the MoE overhead. MoE really shines when you have enough memory to hold the full model and want maximum quality per FLOP. This is why it is a natural fit for cloud serving: providers like OpenAI and Mistral can provision enough memory on their clusters, and per-request compute cost is what drives their margins.

Expert parallelism is a deployment pattern specific to MoE. In a multi-GPU setup, you can place different experts on different GPUs so each device only holds and computes a subset of experts. Tokens get routed across GPUs based on which experts they need, processed, and results are gathered back. This introduces all-to-all communication overhead but allows models far too large for a single device to run efficiently. Google's GShard and Switch Transformer papers demonstrated this at scale, and it is how the largest MoE models are served in production today.

Operational Gotchas

One nuance practitioners run into: MoE models can behave unpredictably during fine-tuning. If you fine-tune on a narrow domain, the router might start funneling all tokens to a small subset of experts, effectively wasting the capacity of the rest. Some teams freeze the router during fine-tuning; others add extra regularization. Quantization is another gotcha — experts that activate rarely have fewer calibration samples, so naive post-training quantization can degrade them disproportionately. The field is actively working through these operational challenges, but MoE is clearly the direction things are heading. Grok, DBRX, Mixtral, and almost certainly GPT-4 all use it, and the efficiency argument only gets stronger as model sizes grow.

Mixture of Experts

為什麼重要

Deep Dive

The Routing Problem

Memory Versus Compute

Operational Gotchas

相關概念