#MoE - Tag | OptiVerse Technology Ltd.

A token walks past a panel of expert chefs; a small router consults a list and lights up just two experts, while the others stand idle for this dish.

MoE 分片：混合專家模型的平行化策略

在 I-00 篇中，我們列出了 LLM 推論與傳統模型服務的五項差異。前四項——可變長度運算、兩階段資源特徵、不斷增長的記憶體需求，以及快取感知路由——每一項都已經在本系列的專文中討論過了。第五項則只用一句話帶過：「部分現代 LLM 使用混合專家模型（MoE, mixture of experts）架構，不同的輸入會啟動模型的不同部分。將 MoE 模型分散部署到多張 GPU 上，所需的分片（sharding）策略與密集模型（dense model）截然不同。」

#MoE #Expert Parallelism #GPU Memory Management +1 更多

Huang Tzu Lin Apr 4, 2026

Inference Infrastructure

MoE Sharding: Parallelism Strategies for Mixture-of-Experts Models

In Post I-00, we listed five ways that LLM inference differs from conventional model serving. The first four -- variable-length computation, two-phase resource profiles, growing memory requirements, and cache-aware routing -- have each received a full post in this track. The fifth was stated in a...

#MoE #Expert Parallelism #GPU Memory Management +1 more

Huang Tzu Lin Apr 4, 2026