AI Assistants, AI Agents, and Everything In Between

A useful AI system is defined less by whether it is called an assistant or an agent than by how much control it has over the next step.

Consider a home buyer who sends a single message to a real estate research tool:

"Compare these three properties, check school zones for each, estimate monthly payments at current mortgage rates, flag any zoning issues, and recommend the best option for a family with two kids."

One request. Five distinct sub-tasks. Answering well requires multiple data lookups, a financial calculation, a comparison across several dimensions, and a judgment call that depends on the buyer's priorities. No single prompt-and-response cycle covers all of that. But a rigid, pre-scripted sequence might miss something important — like a pending rezone that changes the recommendation entirely.

This is the kind of request that forces a practical question: how much should the system do on its own?

The answer is not binary. It is not "assistant or agent." It is a spectrum of autonomy, and the right point on that spectrum depends on the task, the stakes, and the quality of the tools available. Understanding that spectrum is more useful than memorizing a label.

DiagramAssistant and agent are not a binary. The design question is who controls the next step and where human approval still blocks action.

The Same Task, Three Ways

The best way to see why the spectrum matters is to watch the same request handled at three different levels of system autonomy. The buyer's question stays identical. What changes is how much the system controls.

Version 1 — Assistive

The system waits for explicit direction at every step.

The buyer asks for a comparison. The system responds:

"Here are the three listings you selected. Would you like me to check school zones for the first property?"

The buyer says yes. The system looks up school zone data for property A and presents it. Then it waits again.

"Ready to check school zones for the second property?"

And so on. Each sub-task requires the buyer to approve, review, and direct the next step. The system never decides what to do next on its own. It fetches data when asked, presents results, and stops.

This is the assistive pattern. The human drives. The system helps. For simple, low-stakes lookups, this works fine. For a five-part comparison, it becomes tedious. The buyer has to manage the workflow mentally, remembering what has been checked, what remains, and how the pieces fit together.

Version 2 — Fixed Workflow

Now the system follows a predetermined pipeline. When the buyer submits the request, the system executes a fixed sequence:

For each of the three properties, retrieve the listing details.
For each property, look up the school zone.
For each property, calculate the estimated monthly payment.
Compare results across a standard set of dimensions.
Present a summary table.

This is more efficient. The buyer gets a structured comparison without manually driving each step. But the pipeline is rigid. It runs the same steps in the same order regardless of what it finds along the way.

Here is where the limitation shows. Suppose property B sits in a zone with a pending reclassification — a commercial rezone that could bring truck traffic to the residential street within two years. The fixed workflow has no step for checking zoning flags. It was not designed to handle surprises. The comparison table looks clean, but it is missing a material fact.

Version 3 — Bounded Agent

Now the system can plan. It receives the buyer's request, breaks it into sub-tasks, and begins executing — but within defined boundaries.

It retrieves all three listings. It checks school zones. While processing property B, it notices a flag in the county records: a pending rezone application. This was not in the original plan. The system decides to pull the zoning record, reads the application details, and discovers the reclassification could allow light commercial use on the adjacent parcel.

It adjusts the comparison. Property B's school zone is strong, but the zoning risk is material for a family prioritizing a stable residential environment. The system updates its analysis, adds the zoning finding, and presents the result — but it does not make the final recommendation unilaterally.

"Based on schools, monthly cost, and neighborhood stability, Property C appears to be the strongest match for a family with two kids. However, Property B has the highest-rated school zone. Note: Property B has a pending rezone application (case #2025-RZ-0482) that could allow light commercial use on the adjacent lot. I recommend reviewing the zoning details before deciding. Would you like me to pull the full application?"

The system planned, adapted, used tools it was not explicitly told to use for this step, and recognized when to escalate instead of act. That is bounded agency: autonomous enough to handle complexity, constrained enough to stop before high-stakes decisions.

The Autonomy Spectrum

Those three versions are not three different products. They are points on a spectrum. The useful framing is not "assistant versus agent" but five levels of increasing decision-making scope — how much latitude the system has to choose its own next step.

Level 1 — Fixed pipeline. The system follows a predetermined sequence of steps. The human defines the workflow in advance; the system executes it. Efficient for well-understood, repeatable tasks. Breaks when the task deviates from the expected pattern.

Level 2 — Routed workflow. The system can choose between predefined paths based on the input. A classifier or set of rules determines which pipeline to run. More flexible than a fixed pipeline, but the set of possible paths is still determined at design time.

Level 3 — Assistive. The system helps with retrieval, drafting, and analysis, but the user remains in charge of the sequence. The human controls each next move. Useful for research, analysis, and tasks where the user wants direct control over the process.

Level 4 — Bounded agent. The system can plan its own sequence of steps, select tools, and adapt when it encounters unexpected information. It operates within explicit boundaries: approved tool list, spending limits, escalation rules, and mandatory human approval for high-consequence actions.

Level 5 — Long-running autonomous system. The system pursues goals with minimal human oversight, making decisions and taking actions across extended time horizons. This level is largely aspirational for most real applications today. The gap between Level 4 and Level 5 is where most of the unsolved reliability and safety problems live.

The spectrum is not a maturity ladder. Level 5 is not inherently better than Level 2. A fixed pipeline that reliably processes mortgage applications is more valuable than an autonomous agent that occasionally submits wrong offers. The right level depends on task complexity, available tool quality, and the cost of errors.

Tool Use and APIs

At every level above Level 1, the system needs to interact with external services. In the real estate example, those services include an MLS listing database, a school district boundary service, a mortgage rate calculator, and a county zoning records system. The system does not contain this information internally. It reaches out to get it.

The mechanism for that interaction is an API — an application programming interface. An API is a structured way for one program to communicate with another: requesting data, sending instructions, or triggering actions. When the real estate system checks a school zone, it sends a structured request to the school district API with a property address and receives back a structured response containing the zone name, school ratings, and enrollment data.

Tool use in the context of AI systems means that a language model can decide when to call an external API and how to format the request. The model does not execute the API call itself — the surrounding system handles the actual network request. But the model determines that a tool call is needed, selects which tool to use, and specifies the inputs.

This is a critical capability, and it is also a critical constraint. Without the mortgage calculator tool, the system cannot estimate monthly payments — it can only guess based on patterns in its training data, which may reflect outdated rates. As of early 2026, average 30-year fixed mortgage rates in the United States are approximately 6-6.5%, but these fluctuate. A system that guesses "around 3%" because its training data skews toward 2020-2021 rates would produce seriously misleading estimates.

The principle generalizes: an AI system is only as reliable as the tools it can access. A bounded agent with high-quality, well-maintained tools can produce trustworthy results. The same agent architecture with broken or stale tools will produce confident-sounding errors. Tool quality is a system design responsibility, not something the model can compensate for.

Research on tool use in language models, including work by Schick et al. on Toolformer (2023), demonstrated that language models can be trained to recognize when their own outputs would be unreliable and to invoke external tools instead. This is not the model "wanting" to use a tool. It is a learned pattern: given this type of input, generating a tool call produces better downstream results than generating text directly.

Memory Is Not One Thing

When the buyer mentioned "two kids" ten messages ago, the system needs to retain that detail to make a relevant school zone comparison. When the system has checked schools for two of three properties and paused, it needs to know where it left off. When the system queries the MLS database, it is accessing an external knowledge store that exists independently of any conversation.

These are three different information responsibilities, and collapsing them into a single concept of "memory" creates confusion. For this post, the important distinction is between three types.

Conversation memory is information preserved across turns in a dialogue. The buyer's mention of "two kids" and "family" shapes the system's priorities for the rest of the session. If the system loses this context — because the conversation exceeded the model's context window, or because the session restarted — it might optimize for investment return instead of family suitability.

Task state is structured information about a workflow in progress. The system has retrieved listings, checked two of three school zones, and has not yet run the mortgage calculation. Task state tracks what has been done, what remains, and what intermediate results are available. This is closer to a progress tracker than to a memory in the colloquial sense.

Durable knowledge is information stored in external systems that outlives any single conversation or task. The MLS database, the school district records, the county zoning files — these are authoritative data stores. The system reads from them but should not be confused with them. The system does not "remember" that property B is zoned residential. It queries a database that contains that record.

These three types have different lifetimes, different update rules, and different authority levels. Conversation memory is ephemeral and user-specific. Task state is session-scoped and system-managed. Durable knowledge is persistent and governed by external processes. Designing memory is not about giving a system "better recall." It is about deciding which information belongs in which layer and how each layer is maintained.

A later post in the main series separates a fourth layer — working memory, the temporary context assembled for a single reasoning step — but for the conceptual framework here, these three are sufficient.

Not all outputs carry the same risk. A system that presents factual data is doing something fundamentally different from a system that submits a purchase offer. The most useful framework for reasoning about this is a three-level hierarchy of system actions.

Describe. The system presents information without judgment. "Property B is in the Lincoln Elementary school zone. Lincoln Elementary has a 7/10 rating from GreatSchools." The system is reporting what it found. The human interprets the significance.

Recommend. The system adds interpretation. "Based on school ratings, commute time, and monthly cost, Property C appears to be the best fit for your family. However, note the pending rezone on Property B's adjacent parcel." The system has made a judgment, but the human retains decision authority.

Act. The system takes an action with real-world consequences. "Submit an offer of $385,000 on Property C with standard contingencies." This changes the state of the world. It creates a legal obligation. It cannot be undone by rephrasing a prompt.

The boundary between these levels is the most important design decision in any system with real-world effects. Most production systems today operate in the describe-to-recommend range. Moving into the act range requires explicit human approval, and that approval boundary should be a first-class architectural component, not an afterthought.

This framework aligns with principles in the NIST AI Risk Management Framework (AI RMF 1.0, 2023), which emphasizes that AI systems operating with greater autonomy require proportionally stronger governance, transparency, and human oversight mechanisms. The describe/recommend/act hierarchy is one practical way to implement that proportionality.

The real estate agent in Version 3 above demonstrates this correctly. It describes the zoning finding, recommends Property C with a caveat, and stops before acting. If the system had submitted an offer without approval, it would have crossed an approval boundary that the design should not permit.

When NOT to Use an Agent

The autonomy spectrum is not a progression from worse to better. There are clear cases where less autonomy is the right design choice.

Simple lookups. If the user asks "What is the listing price for 42 Oak Street?", a single API call and a formatted response are sufficient. Running an agent loop to plan, reason, and reflect on a one-step task wastes computation and introduces unnecessary failure surface.

High-stakes actions with weak tools. If the only zoning data source is unreliable or out of date, giving a system the autonomy to adjust recommendations based on that data is dangerous. Weak tools plus high autonomy equals confident errors.

Tasks where a fixed pipeline works well. Generating a standard property comparison report from three MLS listings is a well-understood, repeatable task. A fixed pipeline will do it faster, cheaper, and more predictably than an agent that re-plans the approach each time.

Regulatory or compliance contexts. Some domains require that every step in a decision process be predetermined and auditable. An agent that dynamically chooses its own path may be architecturally incompatible with those requirements, regardless of how good its decisions are.

The general principle: use the lowest level of autonomy that solves the task reliably. Autonomy is a tool, not a goal.

Common Failure Modes

When systems do operate at higher autonomy levels, several failure patterns recur. Recognizing them helps in both design and debugging.

Wrong tool call. The system selects an inappropriate tool for the sub-task. It queries the school district API with a zoning question, or it sends a malformed request to the mortgage calculator. This is not a reasoning failure in the abstract — the system may have a sensible plan — but a mismatch between the plan and the available tools. Tool descriptions and input schemas need to be precise enough for the model to select correctly.

Looping. The system repeats the same action without making progress. It queries the MLS database, gets an unexpected response format, retries with the same query, gets the same response, and retries again. Without explicit loop detection or a maximum iteration count, this can continue indefinitely, consuming resources and producing no useful output.

Premature synthesis. The system generates a final answer before completing all sub-tasks. It checks school zones for two of three properties, then writes a comparison as if it had all three. The output looks complete but is missing data. This often happens when the model's training patterns favor generating fluent conclusions over tracking task completeness.

Missing escalation. The system encounters a situation that exceeds its competence or authority but does not flag it. It finds the pending rezone on property B but treats it as a minor footnote instead of a material finding that should change the recommendation structure. The system needed to escalate — to surface the finding prominently and ask the human to weigh in — but its design did not include that trigger.

These failure modes are not hypothetical edge cases. Industry experience with production agent systems consistently reports that tool selection errors, infinite loops, and premature termination are among the most common failure categories. The ReAct framework proposed by Yao et al. (2022) — which interleaves reasoning steps with action steps — was designed in part to address premature synthesis by forcing the model to explicitly reason about observations before generating the next action. But the framework does not eliminate these failures. It structures the system to make them more observable and debuggable.

What This Post Is Not

This post does not explain how agent loops work mechanically — how tool calls are formatted, how observations are fed back into the model, or how stop conditions are evaluated. Those implementation details matter, but they belong to a different level of explanation.

It does not cover advanced retrieval strategies, embedding models, or vector databases. Those were introduced in Post 0-3 and will be developed further in the main series.

It does not claim that agents "learn from experience" in the sense of updating their own parameters during use. When a bounded agent adjusts its plan after discovering the zoning flag, it is responding to new information within its current context. It is not retraining itself. The distinction matters: adapting within a session is a system behavior; learning across sessions is a training behavior. Conflating the two leads to inflated expectations about what deployed systems can do.

And it does not argue that more autonomy is inherently better. The entire point of the spectrum framework is that the right level of autonomy is a design decision, not a destination.

Where This Leads

You have seen how a real estate assistant can evolve from simple Q&A to a bounded agent that plans, adapts, and knows when to stop. The autonomy spectrum, tool use, memory layers, and approval boundaries are the conceptual vocabulary for reasoning about these systems.

But how do all these concepts look when they converge inside a real product? The next post examines a complete system where retrieval, tool use, memory, and approval boundaries work together — not as isolated features, but as an integrated architecture.

Key Terms Introduced in This Post

Term — Definition

Assistant (AI) — An interaction pattern where the system helps a user perform a task, typically responding to explicit instructions and waiting for direction between steps

Workflow — A predefined sequence of steps with control logic fixed in code; the model operates within bounded steps but does not choose the sequence

Agent (AI) — A system that can plan its own sequence of actions, select tools, and adapt to new information within defined boundaries

Autonomy spectrum — A five-level framework (fixed pipeline, routed workflow, assistive, bounded agent, long-running autonomous) for describing how much control a system has over its own execution

API — Application programming interface — a structured way for one program to communicate with another by requesting data, sending instructions, or triggering actions

Tool use — The capability of a language model to determine when to invoke an external service and how to format the request

Conversation memory — Information preserved across turns in a dialogue, such as user preferences or stated constraints

Task state — Structured information tracking a workflow in progress — what has been done, what remains, and what intermediate results are available

Durable knowledge — Information in external systems (databases, records) that persists independently of any conversation or task session

Approval boundary — The architectural decision point that determines whether a system can describe, recommend, or act — and where human approval is required

Describe / Recommend / Act — A three-level hierarchy of system outputs, ordered by increasing real-world consequence and decreasing human control

Sources and Further Reading

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models." *arXiv preprint arXiv:2210.03629*. Introduced the framework for interleaving reasoning traces with tool actions in language model agents. arxiv.org/abs/2210.03629

Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Hambro, E., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." *NeurIPS 2023; arXiv:2302.04761*. Demonstrated that language models can learn to invoke external tools when their own outputs would be unreliable. arxiv.org/abs/2302.04761

National Institute of Standards and Technology. (2023). *AI Risk Management Framework (AI RMF 1.0)*. NIST AI 100-1. Provides governance principles for AI systems, including proportional oversight as autonomy increases. nist.gov/itl/ai-risk-management-framework

Zaharia, M., Khattab, O., Chen, L., Davis, J. Q., Miller, H., Potts, C., Zou, J., Carbin, M., Frankle, J., Rao, N., & Ghodsi, A. (2024). "The Shift from Models to Compound AI Systems." *Berkeley AI Research Blog*. Argues that state-of-the-art AI results increasingly come from multi-component systems rather than monolithic models. bair.berkeley.edu/blog/2024/02/18/compound-ai-systems

Kapoor, S., Stroebl, B., Siegel, Z. S., Nadgir, N., & Narayanan, A. (2024). "AI Agents That Matter." *Princeton University; arXiv:2407.01502*. Analyzes agent evaluation methodology, documenting how benchmarking shortcomings -- including cost-accuracy trade-offs, reproducibility problems, and overfitting to benchmarks -- obscure real-world agent performance. arxiv.org/abs/2407.01502

AI Assistants, AI Agents, and Everything In Between

LLM Foundations

Why LLMs Need Help — Hallucinations, Grounding, and the Case for Systems

AI-Powered Customer Support — From Chatbot to Intelligent System