Mixture of Experts: Definition & Meaning — AI Wiki

Uma arquitetura onde o modelo contém múltiplas sub-redes “especialistas”, mas só ativa algumas delas para cada entrada. Uma rede router decide quais especialistas são relevantes para um token dado. Isso significa que um modelo pode ter 100B+ parâmetros totais mas usar só 20B para qualquer forward pass único.

Por que importa

MoE é como modelos como Mixtral e (supostamente) GPT-4 obtêm a qualidade de um modelo enorme com a velocidade de um menor. O trade-off é maior uso de memória (todos os especialistas devem ser carregados) mesmo que a computação seja mais barata.

Deep Dive

In a standard Transformer, every token passes through the same feedforward network (FFN) in each layer. In an MoE Transformer, that single FFN is replaced with multiple parallel FFNs — the "experts" — plus a small routing network (often called a gate) that decides which experts process each token. Typically, the gate selects the top-k experts (usually 2) and blends their outputs using the gate's softmax weights. The key insight is that total parameter count can be massive (giving the model enormous capacity to memorize and generalize), while per-token compute stays manageable because most experts are idle for any given input. Mixtral 8x7B, for example, has roughly 47B total parameters but activates only about 13B per token.

The Routing Problem

The routing mechanism is where most of the engineering complexity lives. A naive router might send all tokens to the same few experts, leaving others unused — a problem called expert collapse. To prevent this, MoE models use auxiliary load-balancing losses that penalize uneven expert utilization during training. The original Switch Transformer from Google used top-1 routing (one expert per token) and achieved impressive scaling, but most modern MoE models prefer top-2 routing for stability. Some newer approaches like DeepSeekMoE add shared experts that always activate alongside the routed ones, ensuring a baseline level of processing for every token regardless of routing decisions.

Memory Versus Compute

The trade-off that defines MoE deployment is memory versus compute. Even though only a fraction of experts are active per token, all of them must be loaded into memory. A 8x7B MoE model needs roughly the same memory as a dense 47B model, even though it runs at roughly the speed of a 13B dense model. This makes MoE models awkward for consumer hardware — if you can only fit 13B parameters in your GPU VRAM, you would get the same inference speed from a dense 13B model without the MoE overhead. MoE really shines when you have enough memory to hold the full model and want maximum quality per FLOP. This is why it is a natural fit for cloud serving: providers like OpenAI and Mistral can provision enough memory on their clusters, and per-request compute cost is what drives their margins.

Expert parallelism is a deployment pattern specific to MoE. In a multi-GPU setup, you can place different experts on different GPUs so each device only holds and computes a subset of experts. Tokens get routed across GPUs based on which experts they need, processed, and results are gathered back. This introduces all-to-all communication overhead but allows models far too large for a single device to run efficiently. Google's GShard and Switch Transformer papers demonstrated this at scale, and it is how the largest MoE models are served in production today.

Operational Gotchas

One nuance practitioners run into: MoE models can behave unpredictably during fine-tuning. If you fine-tune on a narrow domain, the router might start funneling all tokens to a small subset of experts, effectively wasting the capacity of the rest. Some teams freeze the router during fine-tuning; others add extra regularization. Quantization is another gotcha — experts that activate rarely have fewer calibration samples, so naive post-training quantization can degrade them disproportionately. The field is actively working through these operational challenges, but MoE is clearly the direction things are heading. Grok, DBRX, Mixtral, and almost certainly GPT-4 all use it, and the efficiency argument only gets stronger as model sizes grow.

Mixture of Experts

Por que importa

Deep Dive

The Routing Problem

Memory Versus Compute

Operational Gotchas

Conceitos relacionados