MoE Sharding: Parallelism Strategies for Mixture-of-Experts Models
In Post I-00, we listed five ways that LLM inference differs from conventional model serving. The first four -- variable-length computation, two-phase resource profiles, growing memory requirements, and cache-aware routing -- have each received a full post in this track. The fifth was stated in a single sentence: "Some modern LLMs use a mixture-of-experts architecture, where different parts of the model activate for different inputs.…