Reading roadmap
LLM Foundations
Follow the 5 posts in this series. Start with chapter 1 or jump to any chapter.
From LLM fundamentals to inference infrastructure — a practical guide to building and serving production AI systems.
Reading roadmap
Follow the 5 posts in this series. Start with chapter 1 or jump to any chapter.
Chapter
Click any chapter card to jump straight into the sequence.
You type a sentence into an AI application. Seconds later, it returns several paragraphs of fluent, well-organized text that reads like it was written by a knowledgeable human. That experience is now routine. What is not routine — and what matters if you plan to build anything on top of these sys...
In the previous post, we sent a single line to an LLM — "Plan a trip to Helsinki" — and got back an itinerary full of specific-sounding details: restaurant names, transit directions, day-trip logistics. It was fluent and plausible, but several of those details turned out to be wrong. The model wa...
Large language models produce fluent, confident text. That confidence is the problem. A model can sound authoritative about a property listing that no longer exists, a tax rate that changed last quarter, or a school rating from three years ago. It has no mechanism to check. It was not designed to...
A useful AI system is defined less by whether it is called an assistant or an agent than by how much control it has over the next step.
Customer support is a useful capstone example because one message can require retrieval, tool use, memory, routing, and approval boundaries at the same time.
Reading roadmap
Follow the 7 posts in this series. Start with chapter 1 or jump to any chapter.
Chapter
Click any chapter card to jump straight into the sequence.
Most failures in real AI products do not come from the model suddenly becoming unintelligent. They come from asking a model to do work that actually belongs to a larger system: fetch the right data, interpret a messy document, check a schema, track state across steps, and show evidence for the an...
Useful AI systems usually fail for ordinary software reasons before they fail for exotic model reasons. A prototype looks impressive when a single prompt produces a plausible answer, but production systems do not consume plausibility. They consume records, decisions, and actions that need to be r...
Large language models are useful because they can synthesize, explain, and transform information in fluent language. They are unreliable when we ask them to know something current, something private, or something that needs verifiable support. A model may have seen similar material during trainin...
A travel-planning copilot for a mid-size agency is asked a straightforward question: "Did this hotel fail an accessibility review before for wheelchair users?"
Agent has become one of the most overloaded terms in AI. Product teams use it to describe everything from a chat interface with retrieval to a long-running process that can plan, call tools, and take actions on its own. That vocabulary drift creates a practical problem: teams start arguing about ...
Most engineering discussions about agents start too early with the word agent and too late with the operational loop. In practice, the important design question is simpler: once a model can take more than one step, how does the system decide what to do next, what tools it may call, what it is all...
Basic retrieval-augmented generation, or RAG, is still the default grounding pattern for most production systems. If you have a large corpus, frequent updates, and a need to show where an answer came from, retrieval remains the cleanest starting point. But there is a practical limit case where si...
Reading roadmap
Follow the 3 posts in this series. Start with chapter 1 or jump to any chapter.
Chapter
Click any chapter card to jump straight into the sequence.
Most teams first meet document processing through OCR. The problem seems straightforward: convert pages into text, index the text, and let retrieval or an LLM answer questions from it.
Text-only systems break as soon as the evidence stops being mostly text, which is exactly what happens in accessible travel planning when photos, floor plans, route maps, captions, and measurements all shape the answer.
By the time a team reaches an advanced AI travel copilot, the hard question is no longer "Which model should we use?" It is "What has to happen, in what order, with what evidence, with what state, and under whose approval before this system can be trusted in production?" That is an architectural ...
Reading roadmap
Follow the 6 posts in this series. Start with chapter 1 or jump to any chapter.
Chapter
Click any chapter card to jump straight into the sequence.
You have built a travel copilot. A user types a query, your application sends it to an LLM provider's API, and a few seconds later a response streams back. From the application developer's perspective, that is one function call. From the infrastructure's perspective, that function call triggers a...
Post I-00 traced a single request through the inference pipeline: prefill processed all input tokens in parallel, decode generated output tokens one at a time, and the KV cache grew with every step. At the end of that trace, we noted that 49 other agents were submitting queries at roughly the sam...
In Post I-00, we traced a single API call through the inference pipeline and introduced the KV cache: the data structure that stores attention key-value vectors so the model does not recompute them at every decode step. The KV cache grows with every generated token, and it must reside in GPU memo...
Post I-00 established that LLM inference has two phases with fundamentally different resource profiles. Prefill processes all input tokens in parallel and is compute-bound -- the GPU's arithmetic units are the bottleneck. Decode generates tokens one at a time and is memory-bandwidth-bound -- the ...
In Post I-02, we saw that PagedAttention enables different requests to share physical KV cache blocks on the same replica. Two requests with the same system prompt can point to the same physical blocks rather than storing duplicate copies. That sharing mechanism is real and it works -- but only i...
In Post I-00, we listed five ways that LLM inference differs from conventional model serving. The first four -- variable-length computation, two-phase resource profiles, growing memory requirements, and cache-aware routing -- have each received a full post in this track. The fifth was stated in a...