Blog / Tag

#LLM Serving

6 posts

A token walks past a panel of expert chefs; a small router consults a list and lights up just two experts, while the others stand idle for this dish.

MoE 分片：混合專家模型的平行化策略

在 I-00 篇中，我們列出了 LLM 推論與傳統模型服務的五項差異。前四項——可變長度運算、兩階段資源特徵、不斷增長的記憶體需求，以及快取感知路由——每一項都已經在本系列的專文中討論過了。第五項則只用一句話帶過：「部分現代 LLM 使用混合專家模型（MoE, mixture of experts）架構，不同的輸入會啟動模型的不同部分。將 MoE 模型分散部署到多張 GPU 上，所需的分片（sharding）策略與密集模型（dense model）截然不同。」

#MoE #Expert Parallelism #GPU Memory Management +1 更多

Huang Tzu Lin Apr 4, 2026

GPU kitchen grill with multiple steaks of different sizes cooking simultaneously: short orders finishing fast, long orders still going, new orders sliding into the gaps.

Inference Infrastructure

連續批次處理：用一張 GPU 服務大量請求

I-00 追蹤了單一請求通過推論流程的完整路徑：預填充（prefill）以平行方式處理所有輸入 token，解碼（decode）逐一生成輸出 token，而 KV 快取（key-value cache）隨著每一步不斷增長。在那篇文章的結尾，我們注意到還有 49 個其他代理人幾乎同時提交了查詢。這個觀察不是隨口帶過——它直接指向推論服務的核心運作問題。

#Continuous Batching #Latency & Throughput #LLM Serving

Huang Tzu Lin Apr 4, 2026

Hotel-search query routed from a phone through a server cluster into a GPU kitchen where chefs read tokens, plate a response, and stream it back.

Inference Infrastructure

呼叫 API 之後發生了什麼事

你已經建好一個旅遊小助理。使用者輸入一段查詢，你的應用程式把它送到 LLM 供應商的 API，幾秒後回應串流回來。對應用程式開發者來說，那就是一次函式呼叫。但從基礎設施的角度來看，這一次呼叫觸發了一整條流程（pipeline），牽涉到不同的運算階段、專用的記憶體結構、排程決策，還有硬體限制。這些因素加在一起，決定了使用者實際感受到的延遲、吞吐量和成本。

#Inference Pipeline #LLM Serving #KV Cache +1 更多

Huang Tzu Lin Apr 4, 2026

Inference Infrastructure

MoE Sharding: Parallelism Strategies for Mixture-of-Experts Models

In Post I-00, we listed five ways that LLM inference differs from conventional model serving. The first four -- variable-length computation, two-phase resource profiles, growing memory requirements, and cache-aware routing -- have each received a full post in this track. The fifth was stated in a...

#MoE #Expert Parallelism #GPU Memory Management +1 more

Huang Tzu Lin Apr 4, 2026

Inference Infrastructure

Continuous Batching: Serving Many Requests on One GPU

Post I-00 traced a single request through the inference pipeline: prefill processed all input tokens in parallel, decode generated output tokens one at a time, and the KV cache grew with every step. At the end of that trace, we noted that 49 other agents were submitting queries at roughly the sam...

#Continuous Batching #Latency & Throughput #LLM Serving

Huang Tzu Lin Apr 4, 2026

Inference Infrastructure

What Happens After You Call the API

You have built a travel copilot. A user types a query, your application sends it to an LLM provider's API, and a few seconds later a response streams back. From the application developer's perspective, that is one function call. From the infrastructure's perspective, that function call triggers a...

#Inference Pipeline #LLM Serving #KV Cache +1 more

Huang Tzu Lin Apr 4, 2026