Blog / Category

Inference Infrastructure

What happens after you call the LLM API. Covers continuous batching, paged KV cache management, prefill-decode disaggregation, prefix-aware routing, and mixture-of-experts sharding -- the serving layer optimizations that determine latency, throughput, and cost.

12 posts

A token walks past a panel of expert chefs; a small router consults a list and lights up just two experts, while the others stand idle for this dish.

MoE 分片：混合專家模型的平行化策略

在 I-00 篇中，我們列出了 LLM 推論與傳統模型服務的五項差異。前四項——可變長度運算、兩階段資源特徵、不斷增長的記憶體需求，以及快取感知路由——每一項都已經在本系列的專文中討論過了。第五項則只用一句話帶過：「部分現代 LLM 使用混合專家模型（MoE, mixture of experts）架構，不同的輸入會啟動模型的不同部分。將 MoE 模型分散部署到多張 GPU 上，所需的分片（sharding）策略與密集模型（dense model）截然不同。」

#MoE #Expert Parallelism #GPU Memory Management +1 更多

Huang Tzu Lin Apr 4, 2026

Traffic dispatcher sends requests sharing the same long prompt prefix to one GPU that already remembers it, while a round-robin lane scatters identical prefixes to cold GPUs.

Inference Infrastructure

前綴感知路由：考量快取狀態的請求分配

在 I-02 篇中，我們看到 PagedAttention 讓不同的請求能在同一個模型副本（model replica）上共用物理 KV 快取區塊。兩個使用相同系統提示詞（system prompt）的請求可以指向相同的物理區塊，不需要儲存重複的副本。這個共用機制確實有效——但前提是兩個請求必須落在同一個副本上。

#Prefix Caching #Request Routing #KV Cache +1 更多

Huang Tzu Lin Apr 4, 2026

Two GPU kitchens side by side: a high-throughput prep station doing bulk chopping and a fast-fire line plating one dish at a time, exchanging prepared trays through a corridor.

Inference Infrastructure

預填充-解碼解耦：將推論的兩個階段分開

I-00 確立了 LLM 推論具有兩個資源特性截然不同的階段。預填充以平行方式處理所有輸入 token，屬於運算瓶頸（compute-bound）——GPU 的運算單元是限制因素。解碼則逐一生成 token，屬於記憶體頻寬瓶頸（memory-bandwidth-bound）——限制因素在於從 GPU 記憶體讀取模型權重和 KV 快取的速度。I-01 介紹了連續批次處理，它透過在迭代層級而非批次層級進行排程，讓 GPU 保持滿載。I-02 則展示了 PagedAttention 如何消除記憶體浪費，讓更多請求能同時處於活躍狀態。

#Prefill-Decode #Disaggregation #Latency & Throughput

Huang Tzu Lin Apr 4, 2026

GPU memory room divided into many small numbered shelves; one request's receipts spread across non-contiguous shelves with a directory pointing to each slot.

Inference Infrastructure

分頁式 KV 快取：LLM 推論服務的 GPU 記憶體管理

在 I-00 篇中，我們走過了一次 API 呼叫從頭到尾通過推論管線的完整流程，也認識了 KV 快取（key-value cache）——一種用來儲存注意力機制中 key-value 向量的資料結構，讓模型不必在每個解碼步驟重複計算這些向量。KV 快取會隨著每個生成的 token 不斷增長，而且在整個請求期間都必須留在 GPU 記憶體裡。到了 I-01 篇，我們又認識了連續批次處理（continuous batching）：它在迭代層級進行排程，不再傻等批次中最慢的請求跑完，因此能讓更多請求同時保持運作。

#PagedAttention #KV Cache #GPU Memory Management

Huang Tzu Lin Apr 4, 2026

GPU kitchen grill with multiple steaks of different sizes cooking simultaneously: short orders finishing fast, long orders still going, new orders sliding into the gaps.

Inference Infrastructure

連續批次處理：用一張 GPU 服務大量請求

I-00 追蹤了單一請求通過推論流程的完整路徑：預填充（prefill）以平行方式處理所有輸入 token，解碼（decode）逐一生成輸出 token，而 KV 快取（key-value cache）隨著每一步不斷增長。在那篇文章的結尾，我們注意到還有 49 個其他代理人幾乎同時提交了查詢。這個觀察不是隨口帶過——它直接指向推論服務的核心運作問題。

#Continuous Batching #Latency & Throughput #LLM Serving

Huang Tzu Lin Apr 4, 2026

Hotel-search query routed from a phone through a server cluster into a GPU kitchen where chefs read tokens, plate a response, and stream it back.

Inference Infrastructure

呼叫 API 之後發生了什麼事

你已經建好一個旅遊小助理。使用者輸入一段查詢，你的應用程式把它送到 LLM 供應商的 API，幾秒後回應串流回來。對應用程式開發者來說，那就是一次函式呼叫。但從基礎設施的角度來看，這一次呼叫觸發了一整條流程（pipeline），牽涉到不同的運算階段、專用的記憶體結構、排程決策，還有硬體限制。這些因素加在一起，決定了使用者實際感受到的延遲、吞吐量和成本。

#Inference Pipeline #LLM Serving #KV Cache +1 更多

Huang Tzu Lin Apr 4, 2026

Inference Infrastructure

MoE Sharding: Parallelism Strategies for Mixture-of-Experts Models

In Post I-00, we listed five ways that LLM inference differs from conventional model serving. The first four -- variable-length computation, two-phase resource profiles, growing memory requirements, and cache-aware routing -- have each received a full post in this track. The fifth was stated in a...

#MoE #Expert Parallelism #GPU Memory Management +1 more

Huang Tzu Lin Apr 4, 2026

Inference Infrastructure

Prefix-Aware Routing: Cache-Conscious Request Distribution

In Post I-02, we saw that PagedAttention enables different requests to share physical KV cache blocks on the same replica. Two requests with the same system prompt can point to the same physical blocks rather than storing duplicate copies. That sharing mechanism is real and it works -- but only i...

#Prefix Caching #Request Routing #KV Cache +1 more

Huang Tzu Lin Apr 4, 2026

Inference Infrastructure

Prefill-Decode Disaggregation: Splitting the Two Stages of Inference

Post I-00 established that LLM inference has two phases with fundamentally different resource profiles. Prefill processes all input tokens in parallel and is compute-bound -- the GPU's arithmetic units are the bottleneck. Decode generates tokens one at a time and is memory-bandwidth-bound -- the ...

#Prefill-Decode #Disaggregation #Latency & Throughput

Huang Tzu Lin Apr 4, 2026

Inference Infrastructure

Paged KV Cache: GPU Memory Management for LLM Serving

In Post I-00, we traced a single API call through the inference pipeline and introduced the KV cache: the data structure that stores attention key-value vectors so the model does not recompute them at every decode step. The KV cache grows with every generated token, and it must reside in GPU memo...

#PagedAttention #KV Cache #GPU Memory Management

Huang Tzu Lin Apr 4, 2026

Inference Infrastructure

Continuous Batching: Serving Many Requests on One GPU

Post I-00 traced a single request through the inference pipeline: prefill processed all input tokens in parallel, decode generated output tokens one at a time, and the KV cache grew with every step. At the end of that trace, we noted that 49 other agents were submitting queries at roughly the sam...

#Continuous Batching #Latency & Throughput #LLM Serving

Huang Tzu Lin Apr 4, 2026

Inference Infrastructure

What Happens After You Call the API

You have built a travel copilot. A user types a query, your application sends it to an LLM provider's API, and a few seconds later a response streams back. From the application developer's perspective, that is one function call. From the infrastructure's perspective, that function call triggers a...

#Inference Pipeline #LLM Serving #KV Cache +1 more

Huang Tzu Lin Apr 4, 2026