Skip to content
Blog

Intelligent Systems
Design Series

From LLM fundamentals to inference infrastructure — a practical guide to building and serving production AI systems.

Learning Paths

Reading roadmap

LLM Foundations

Follow the 5 posts in this series. Start with chapter 1 or jump to any chapter.

5
chapters
1
Phases
Start with chapter 1

Chapter

Click any chapter card to jump straight into the sequence.

5chapters
  1. 01Foundations

    What Large Language Models Actually Do

    You type a sentence into an AI application. Seconds later, it returns several paragraphs of fluent, well-organized text that reads like it was written by a knowledgeable human. That experience is now routine. What is not routine — and what matters if you plan to build anything on top of these sys...

  2. 02Foundations

    Prompts, Context Windows, and How You Talk to an LLM

    In the previous post, we sent a single line to an LLM — "Plan a trip to Helsinki" — and got back an itinerary full of specific-sounding details: restaurant names, transit directions, day-trip logistics. It was fluent and plausible, but several of those details turned out to be wrong. The model wa...

  3. 03Foundations

    Why LLMs Need Help — Hallucinations, Grounding, and the Case for Systems

    Large language models produce fluent, confident text. That confidence is the problem. A model can sound authoritative about a property listing that no longer exists, a tax rate that changed last quarter, or a school rating from three years ago. It has no mechanism to check. It was not designed to...

  4. 04Foundations

    AI Assistants, AI Agents, and Everything In Between

    A useful AI system is defined less by whether it is called an assistant or an agent than by how much control it has over the next step.

  5. 05Foundations

    AI-Powered Customer Support — From Chatbot to Intelligent System

    Customer support is a useful capstone example because one message can require retrieval, tool use, memory, routing, and approval boundaries at the same time.

Reading roadmap

Building AI Systems

Follow the 7 posts in this series. Start with chapter 1 or jump to any chapter.

7
chapters
2
Phases
Start with chapter 1

Chapter

Click any chapter card to jump straight into the sequence.

7chapters
  1. 01Architecture

    From Models to Compound AI Systems

    Most failures in real AI products do not come from the model suddenly becoming unintelligent. They come from asking a model to do work that actually belongs to a larger system: fetch the right data, interpret a messy document, check a schema, track state across steps, and show evidence for the an...

  2. 02Architecture

    Reliable LLM Pipelines and Control Logic

    Useful AI systems usually fail for ordinary software reasons before they fail for exotic model reasons. A prototype looks impressive when a single prompt produces a plausible answer, but production systems do not consume plausibility. They consume records, decisions, and actions that need to be r...

  3. 03Architecture

    Grounding with RAG: How AI Systems Retrieve Evidence Before They Answer

    Large language models are useful because they can synthesize, explain, and transform information in fluent language. They are unreliable when we ask them to know something current, something private, or something that needs verifiable support. A model may have seen similar material during trainin...

  4. 04Control

    Memory, State, and Knowledge: Stop Calling Everything "Memory"

    A travel-planning copilot for a mid-size agency is asked a straightforward question: "Did this hotel fail an accessibility review before for wheelchair users?"

  5. 05Control

    Assistants, Workflows, and Agents: Designing for the Right Level of Autonomy

    Agent has become one of the most overloaded terms in AI. Product teams use it to describe everything from a chat interface with retrieval to a long-running process that can plan, call tools, and take actions on its own. That vocabulary drift creates a practical problem: teams start arguing about ...

  6. 06Control

    Agent Loops in Practice: ReAct, Tools, and Failure Modes

    Most engineering discussions about agents start too early with the word agent and too late with the operational loop. In practice, the important design question is simpler: once a model can take more than one step, how does the system decide what to do next, what tools it may call, what it is all...

  7. 07Control

    When RAG Is Not Enough: CAG, Hybrid Retrieval, and Working Memory

    Basic retrieval-augmented generation, or RAG, is still the default grounding pattern for most production systems. If you have a large corpus, frequent updates, and a need to show where an answer came from, retrieval remains the cleanest starting point. But there is a practical limit case where si...

Reading roadmap

LLM Inference Infrastructure

Follow the 6 posts in this series. Start with chapter 1 or jump to any chapter.

6
chapters
3
Phases
Start with chapter 1

Chapter

Click any chapter card to jump straight into the sequence.

6chapters
  1. 01Foundation

    What Happens After You Call the API

    You have built a travel copilot. A user types a query, your application sends it to an LLM provider's API, and a few seconds later a response streams back. From the application developer's perspective, that is one function call. From the infrastructure's perspective, that function call triggers a...

  2. 02Core

    Continuous Batching: Serving Many Requests on One GPU

    Post I-00 traced a single request through the inference pipeline: prefill processed all input tokens in parallel, decode generated output tokens one at a time, and the KV cache grew with every step. At the end of that trace, we noted that 49 other agents were submitting queries at roughly the sam...

  3. 03Core

    Paged KV Cache: GPU Memory Management for LLM Serving

    In Post I-00, we traced a single API call through the inference pipeline and introduced the KV cache: the data structure that stores attention key-value vectors so the model does not recompute them at every decode step. The KV cache grows with every generated token, and it must reside in GPU memo...

  4. 04Core

    Prefill-Decode Disaggregation: Splitting the Two Stages of Inference

    Post I-00 established that LLM inference has two phases with fundamentally different resource profiles. Prefill processes all input tokens in parallel and is compute-bound -- the GPU's arithmetic units are the bottleneck. Decode generates tokens one at a time and is memory-bandwidth-bound -- the ...

  5. 05Applied

    Prefix-Aware Routing: Cache-Conscious Request Distribution

    In Post I-02, we saw that PagedAttention enables different requests to share physical KV cache blocks on the same replica. Two requests with the same system prompt can point to the same physical blocks rather than storing duplicate copies. That sharing mechanism is real and it works -- but only i...

  6. 06Applied

    MoE Sharding: Parallelism Strategies for Mixture-of-Experts Models

    In Post I-00, we listed five ways that LLM inference differs from conventional model serving. The first four -- variable-length computation, two-phase resource profiles, growing memory requirements, and cache-aware routing -- have each received a full post in this track. The fifth was stated in a...