Blog / Tag

#GPU Memory Management

4 posts

A token walks past a panel of expert chefs; a small router consults a list and lights up just two experts, while the others stand idle for this dish.

MoE 分片：混合專家模型的平行化策略

在 I-00 篇中，我們列出了 LLM 推論與傳統模型服務的五項差異。前四項——可變長度運算、兩階段資源特徵、不斷增長的記憶體需求，以及快取感知路由——每一項都已經在本系列的專文中討論過了。第五項則只用一句話帶過：「部分現代 LLM 使用混合專家模型（MoE, mixture of experts）架構，不同的輸入會啟動模型的不同部分。將 MoE 模型分散部署到多張 GPU 上，所需的分片（sharding）策略與密集模型（dense model）截然不同。」

#MoE #Expert Parallelism #GPU Memory Management +1 更多

Huang Tzu Lin Apr 4, 2026

GPU memory room divided into many small numbered shelves; one request's receipts spread across non-contiguous shelves with a directory pointing to each slot.

Inference Infrastructure

分頁式 KV 快取：LLM 推論服務的 GPU 記憶體管理

在 I-00 篇中，我們走過了一次 API 呼叫從頭到尾通過推論管線的完整流程，也認識了 KV 快取（key-value cache）——一種用來儲存注意力機制中 key-value 向量的資料結構，讓模型不必在每個解碼步驟重複計算這些向量。KV 快取會隨著每個生成的 token 不斷增長，而且在整個請求期間都必須留在 GPU 記憶體裡。到了 I-01 篇，我們又認識了連續批次處理（continuous batching）：它在迭代層級進行排程，不再傻等批次中最慢的請求跑完，因此能讓更多請求同時保持運作。

#PagedAttention #KV Cache #GPU Memory Management

Huang Tzu Lin Apr 4, 2026

Inference Infrastructure

MoE Sharding: Parallelism Strategies for Mixture-of-Experts Models

In Post I-00, we listed five ways that LLM inference differs from conventional model serving. The first four -- variable-length computation, two-phase resource profiles, growing memory requirements, and cache-aware routing -- have each received a full post in this track. The fifth was stated in a...

#MoE #Expert Parallelism #GPU Memory Management +1 more

Huang Tzu Lin Apr 4, 2026

Inference Infrastructure

Paged KV Cache: GPU Memory Management for LLM Serving

In Post I-00, we traced a single API call through the inference pipeline and introduced the KV cache: the data structure that stores attention key-value vectors so the model does not recompute them at every decode step. The KV cache grows with every generated token, and it must reside in GPU memo...

#PagedAttention #KV Cache #GPU Memory Management

Huang Tzu Lin Apr 4, 2026