An architecture where the model contains multiple "expert" sub-networks, but only activates a few of them for each input. A router network decides which experts are relevant for a given token. This means a model can have 100B+ total parameters but only use 20B for any single forward pass.
Why it matters
MoE is how models like Mixtral and (reportedly) GPT-4 get the quality of a huge model with the speed of a smaller one. The trade-off is higher memory usage (all experts must be loaded) even though computation is cheaper.