Mixture of Experts

An architecture where the model contains multiple "expert" sub-networks, but only activates a few of them for each input. A router network decides which experts are relevant for a given token. This means a model can have 100B+ total parameters but only use 20B for any single forward pass.

Why it matters

MoE is how models like Mixtral and (reportedly) GPT-4 get the quality of a huge model with the speed of a smaller one. The trade-off is higher memory usage (all experts must be loaded) even though computation is cheaper.

Deep Dive

In a standard Transformer, every token passes through the same feedforward network (FFN) in each layer. In an MoE Transformer, that single FFN is replaced with multiple parallel FFNs — the "experts" — plus a small routing network (often called a gate) that decides which experts process each token. Typically, the gate selects the top-k experts (usually 2) and blends their outputs using the gate's softmax weights. The key insight is that total parameter count can be massive (giving the model enormous capacity to memorize and generalize), while per-token compute stays manageable because most experts are idle for any given input. Mixtral 8x7B, for example, has roughly 47B total parameters but activates only about 13B per token.

The Routing Problem

The routing mechanism is where most of the engineering complexity lives. A naive router might send all tokens to the same few experts, leaving others unused — a problem called expert collapse. To prevent this, MoE models use auxiliary load-balancing losses that penalize uneven expert utilization during training. The original Switch Transformer from Google used top-1 routing (one expert per token) and achieved impressive scaling, but most modern MoE models prefer top-2 routing for stability. Some newer approaches like DeepSeekMoE add shared experts that always activate alongside the routed ones, ensuring a baseline level of processing for every token regardless of routing decisions.

Memory Versus Compute

The trade-off that defines MoE deployment is memory versus compute. Even though only a fraction of experts are active per token, all of them must be loaded into memory. A 8x7B MoE model needs roughly the same memory as a dense 47B model, even though it runs at roughly the speed of a 13B dense model. This makes MoE models awkward for consumer hardware — if you can only fit 13B parameters in your GPU VRAM, you would get the same inference speed from a dense 13B model without the MoE overhead. MoE really shines when you have enough memory to hold the full model and want maximum quality per FLOP. This is why it is a natural fit for cloud serving: providers like OpenAI and Mistral can provision enough memory on their clusters, and per-request compute cost is what drives their margins.

Expert parallelism is a deployment pattern specific to MoE. In a multi-GPU setup, you can place different experts on different GPUs so each device only holds and computes a subset of experts. Tokens get routed across GPUs based on which experts they need, processed, and results are gathered back. This introduces all-to-all communication overhead but allows models far too large for a single device to run efficiently. Google's GShard and Switch Transformer papers demonstrated this at scale, and it is how the largest MoE models are served in production today.

Operational Gotchas

One nuance practitioners run into: MoE models can behave unpredictably during fine-tuning. If you fine-tune on a narrow domain, the router might start funneling all tokens to a small subset of experts, effectively wasting the capacity of the rest. Some teams freeze the router during fine-tuning; others add extra regularization. Quantization is another gotcha — experts that activate rarely have fewer calibration samples, so naive post-training quantization can degrade them disproportionately. The field is actively working through these operational challenges, but MoE is clearly the direction things are heading. Grok, DBRX, Mixtral, and almost certainly GPT-4 all use it, and the efficiency argument only gets stronger as model sizes grow.

Why it matters

Deep Dive

The Routing Problem

Memory Versus Compute

Operational Gotchas

Related Concepts