May 31, 2026

My Understanding of Agent Architecture

How a Minimalist Engine and a Smart System Prompt Work Together to Power Modern AI Agents

Core Design Philosophy: Minimalist Engine, Intelligence in the Prompt

At its core, an Agent's engine only performs four tasks: serializing context into messages that the LLM can understand, calling the LLM to get a response, parsing tool calls within that response and executing the corresponding tools, and writing the tool results back into the context as tool_result. All the parts that "make it smart"—when to search, when to plan, when to confirm with the user, and how to organize the final output—are decided entirely by the system prompt and the LLM itself.

The engine does not make choices for the LLM. This is a common characteristic of products like Devin and Claude Code, and it is the biggest difference between them and the "complex chain" approach of early LangChain. LangChain attempted to orchestrate logic branches and conditional judgments at the framework level, resulting in an increasingly heavy system. In contrast, the new generation of Agents trusts the LLM's reasoning capabilities, with the engine only responsible for mechanical cyclic execution.

Architectural Layering

The entire system can be viewed in three layers.

The first layer is the Engine Layer. It is minimalist, mechanical, and predictable. Its sole responsibility is to maintain a task loop: calling the LLM, parsing tool calls, executing tools, and managing context. Things the engine absolutely does not do include: choosing tools for the LLM, rewriting query parameters for the LLM, "intelligently" deciding whether to retry, or sneaking extra instructions into the prompt (all injections are fully visible to the LLM). All engine retries and exception recoveries follow mechanical rules: exponential backoff for timeouts, backoff for 429 errors, and marking failure directly for structural errors. There are no heuristic judgments involved.

The second layer is the Prompt Layer (System Prompt). This is where the intelligence is concentrated. The system prompt consists of several XML tag segments, each being a declarative "rule plus example" without any if-else logic. The LLM reads these rules in each dialogue turn and makes autonomous decisions. These segments include language settings (output language), output format specifications (Markdown preferred, no emojis), Agent loop descriptions (to internalize the task flow), tool usage specifications (including calling only one tool at a time), error handling strategies, current environment descriptions, pre-installed CLI tool lists, browser usage specifications, secret placeholder mechanisms, prompt injection security rules, user preference configurations, skill catalogs, and tool protocol descriptions. The key point is that each segment provides rules and examples, and the LLM decides how to act after reading them.

The third layer is the Tool Layer. It is the capability extension point for the engine to interact with the external world. Each tool has two faces: a ToolDefinition, containing the name, description, and parameter JSON Schema (this part is for the LLM); and a Handler Function, which receives parameters and context to return results (this part is called by the engine). Tools are generally divided into four categories:

Communication tools (e.g., message, plan): Stay within the engine's scope, writing to the context or updating state.
Information tools (e.g., search, browser, vision): Responsible for fetching external information.
Modification tools (e.g., file, shell, match): Change the state of the working environment.
Generation tools (e.g., slides, generate): Responsible for producing large content artifacts.

The Task Loop: The Heart of the Agent

The TaskLoop is the core of the entire system. When a user message arrives, the loop starts; it exits when the task is completed or interrupted. All steps—interacting with the LLM, calling tools, and processing results—operate within this loop.

The state machine has only six states:

INITIALIZING: The initialization phase, creating task records, loading historical dialogues, and building the system prompt.
THINKING: The phase waiting for the LLM's response, supporting streaming output.
EXECUTING_TOOL: The phase executing a specific tool handler.
FINALIZING: The phase generating a final summary and suggesting next steps.
DONE and FAILED: Final states.

Each iteration follows a fixed set of steps:

Pre-flight Checks: Check if the user clicked stop (masterAbort), if there was an interjection, and re-filter the visible tool list based on the current context.
callLLM: Retrieve the full messages array from SessionContext, send it to the LLM after redaction and length validation, and receive the LLMTurnResult. Automatic retry mechanisms cover timeouts, 429 rate limits, and 5xx server errors, all using backoff. Structural errors (e.g., completely unparseable API response format) are marked as failures without retry.
Handle No Tool Call: By design, the LLM should always produce a tool call. If it doesn't, a SystemReminder correction message is injected to prompt a tool call response. If it still fails after multiple attempts, it is marked as a failure.
Single-step Enforcement: Even if the LLM returns multiple tool calls in one turn, the engine only executes the first one, deferring the rest to be injected back into the context in the next turn. This is done for three reasons: parallel steps can cause context window explosion, tools often have implicit order dependencies, and error recovery granularity would be coarser and harder to diagnose. The cost is an extra LLM interaction for each tool call, leading to higher token costs and latency, but for Agent applications, this trade-off is worth it as single-call costs are not sensitive.
Persist Assistant Message: Write the LLM's response into SessionContext, marking the contextType as ToolCall or AiMessage, and saving the thinking field if present.
preExecutionChecks: Includes tool loop detection (intercepting if the same tool is called with the same parameters more than five times), working directory boundary checks (ensuring file operations don't exceed specified directories), dangerous operation approval (requiring user confirmation for tasks like deleting files), and plugin hook functions. The check results determine whether to execute or skip.
executeTool: Find the corresponding handler based on the tool name and execute it with the provided parameters. Promise.race is used to wrap the handler and an interrupt signal, allowing immediate termination if the user interrupts.
postExecution: Write the tool's returned content into SessionContext as a ToolResult, process fields like attachments, waitForReply, and finalResult, and push progress information to the frontend.
manageContext: Re-inject deferred tool calls into the context for the next turn, check if the total token count triggers the compression threshold, and summarize large reads for file exploration. Then, return to step one to continue the loop until the LLM calls the message tool to declare task completion or a failure condition is met.

Exit conditions include:

The LLM calls the message tool with type result, actively declaring completion (enters DONE).
A fatal error occurs, such as a structural error or a handler throwing an unrecoverable exception (enters FAILED).
The user calls stop, terminating the task (enters FAILED with a stopped flag).
The maximum iteration limit is reached to prevent infinite loops (enters FAILED).
If an interjection occurs with new user input, it's not an exit but an interruption followed by starting a new turn.

Three Key Engineering Techniques

Interjection Mechanism: What if the user sends a new message while the LLM is still speaking or a tool is still running? The Gateway calls the interject method of the taskLoop upon receiving a new message, setting an interrupt flag and aborting the current LLM or tool call. The current turn exits immediately, and the flag is consumed in the next turn's pre-flight check, injecting the new message as user input. Without this, a running tool (e.g., a 30-second web crawl) would waste time. At the code level, executeTool uses Promise.race with a toolAbort signal for instant termination.
Resume Checkpoint: Unfinished tasks can continue after a server restart. Every time the system waits for a user reply (ask), it writes metadata.ask.answered = false to the database's assistant entry; for dangerous operation approvals, it writes metadata.approvalPending = true. After a restart, the system scans all tasks; those in waiting_for_reply or waiting_for_approval states are reloaded, wait-parsers are rebuilt, and wait UIs are broadcasted to the frontend. This prevents tasks from getting stuck or losing state.
Tool Loop Detection: LLMs often fall into infinite loops calling the same tool with the same parameters (e.g., repeatedly reading a non-existent file). The solution is to maintain a hash set of the last N tool calls. If a combination of tool name and parameter hash appears more than a threshold within the window, it's flagged as a loop. The engine then injects a SystemReminder saying "You are in a loop, please try a different approach," letting the LLM decide how to adjust. Again, the engine doesn't choose for it, reflecting the "intelligence in the prompt" philosophy.

Two-Layer Context Model

Context management is one of the hardest parts to get right. The architecture is split into two layers.

SessionContext lasts for the entire session. it maintains the llmContext (an array of LLM dialogue history), a WebSocket push emitter, and a reference to the current TaskContext. Key methods include append for unified three-way writing, update for updating existing entries, getLLMContext for producing input for the LLM, and loadFromDB for reconstruction from the database. Multiple tasks might run in one session, but the dialogue history seen by the LLM is continuous. SessionContext holds this history across tasks, while TaskContext only stores the runtime state of the current task.

TaskContext lasts for a single task. It holds taskId, userId, modelId, the current list of available tools and tool filtering context, final result flags, failure flags, stop flags, token usage statistics, the loop detector, and callback functions for waiting for replies or approval requests.

The append method follows the Unified Three-Way Principle. This is a critical design. A single call to SessionContext.append must simultaneously:

Persist to the task_context table in the database.
Push to the frontend via WebSocket.
Append to the in-memory llmContext array for the next LLM call. A message either belongs to the context or it doesn't. If it does, the same content must be seen in all three places. This principle eliminates a common class of bugs where what the LLM sees differs from what's in the database or what the user sees.

Each context entry has a ContextType identifier:

UserMessage: Original user message.
AiMessage: Text reply from the AI to the user (via the message tool).
ToolCall: Record of the AI calling a tool (with function_calls info).
ToolResult: Result of tool execution.
SystemReminder: Reminder note injected by the engine.
StatusNotice: Status text injected by the engine (e.g., "retrying").
NoToolCallRetry: Discarded turns due to validation failure.
WorkflowStep: Current step info in a workflow. The frontend uses ContextType for rendering styles. The LLM doesn't distinguish between them, only seeing role and content, but this classification allows the frontend to precisely distinguish between real messages and system prompts.

The getLLMContext method runs before each LLM call. It:

Retrieves the system prompt (rebuilt each time by task-context-builder to reflect configuration changes).
Retrieves all non-system entries from the in-memory llmContext.
Applies Compaction logic (compressing intermediate history if total tokens exceed the threshold).
Applies Context Window Guard (hard-truncating the tail if it's still too long).
Returns the final LLMMessage array. llmContext itself never stores the system prompt.

Context Compaction

Agent tasks often run for dozens of turns, and tool results can be huge. Reading a 5KB file can consume about 1500 tokens, quickly exhausting a 100K context window. Without compression, the system either errors out or pushes critical information (like the original user requirement) out of the context.

The basic unit of compression is a Turn, not a single message. A Turn is an atomic unit consisting of an assistant's tool_call, all subsequent tool_results, and any immediately following system_reminder. You either keep the whole Turn or compress it into a single-line summary. This is because Anthropic and Gemini APIs will reject requests with a tool_use but no tool_result, or vice versa. Early attempts at message-based compression caused tasks to get stuck with 400 errors.

The compaction strategy is as follows:

Group messages into Turns.
Calculate a tail token budget (e.g., 15% of the maximum tokens).
Iterate from the newest Turn back to the oldest, adding them to the tail until the budget is filled.
Compress the remaining intermediate Turns into a compacted_history block. Each compressed Turn only keeps a one-line XML summary of the tool name and key parameters.

The final structure is: System Prompt + Compacted History Summary Block + Most Recent Full Turns.

Why aggressive compression instead of rolling summaries? Because it's more reliable to let the LLM re-read an original file when needed than to rely on an engine-generated summary. LLMs can introduce bias or even hallucinate when summarizing. Tool call signatures provide complete information to reproduce the path, which is essentially the only memory an Agent needs. The cost is occasional redundant file reads (mitigated by loop detection), but the benefit is a context that never explodes and is never misled by incorrect summaries.

Context Window Guard is the safety fuse. If the length still exceeds the limit after compaction, tokens are calculated from the tail forward. If it exceeds 90%, the oldest Turns are forcibly removed until it fits. This shouldn't normally trigger; if it does, it usually means a single tool_result exceeded the entire budget, requiring investigation.

System Reminders are injected via an event-driven approach. Two principles apply:

Event-driven, not resident: Reminders like "You are at step 3 of the workflow" are dynamically generated based on current state before each LLM call. Static rules (e.g., "no emojis") belong in the system prompt.
Injected as real entries: With contextType as SystemReminder and role as user, written to SessionContext to ensure consistency across the database, frontend, and LLM. Typical scenarios include workflow step prompts, prompts when the LLM misses a tool call, loop warnings, and language drift corrections (e.g., "Switch back to Chinese").

The Turn Grouper is a hidden key component. It slices the flat LLMMessage array into a Turn array for Compaction, Context Guard, and Resume modules. Rules: user messages are independent Turns; an assistant message with tool_call plus its tool_results and system_reminder form a Turn; a plain text assistant reply is an independent Turn. Its unit tests are among the most important in the project.

Tool System

Tools are the Agent's interface to the external world. The LLM uses them to read files, run commands, search for info, and send messages. Three roles are involved:

ToolDefinition: Seen by the LLM (name, description, parameter JSON Schema, execution target).
ToolHandler: Seen by the engine (an async function receiving parameters and context). It must be a pure function for easy unit testing, not directly writing to the database or pushing to WebSockets, but declaring all side effects via its return value.
ToolPolicy: Defines admin permission requirements and danger levels.

Tool registration uses a side-effect import pattern. Each tool file registers itself to a global registry upon import. tools/index.ts imports all handler modules, and a single line in task-loop.ts (import "./tools") triggers the entire registration chain.

ToolHandlerResult uses a declarative side-effect model. The Handler doesn't perform side effects directly but returns structured instructions. Required field: content (tool result for the LLM). Optional fields: metadata, finalResult (marks task completion), shouldBreak (forces loop exit), summary (custom compaction summary), and attachments.

The tool execution flow: After the LLM returns toolCalls, arguments are parsed and type-coerced via coerceToolArgs (LLMs often return numbers or booleans as strings). Then, preExecutionChecks (loop detection, file system boundaries, dangerous operation approvals, plugin hooks) determine whether to execute or skip. If executed, the handler is called, and postExecution persists the tool_result, pushes progress cards to the frontend, and handles special fields.

coerceToolArgs is a life-saving design. It recursively corrects type mismatches based on the tool's parameter schema. Without it, a handler's "greater than" comparison might fail if it's comparing strings. This centralization keeps handler code clean and prevents redundant defensive conversions.

The Tool List seen by the LLM is not static; it's recomputed each iteration. Filters include:

User Role: Regular users can't see admin tools.
Custom Blacklist: Users or the Agent can block certain tools.
Browser State: Hide browser_click if the browser isn't open.
Workflow State: Hide the plan tool in workflow mode. Dynamic calculation saves tokens and reduces miscalls. A filterKey is used to rebuild the list only when necessary to avoid breaking prompt cache.

Each tool can declare a dangerLevel:

safe: Read/query operations, executed directly.
moderate: Modifying local state (e.g., writing files), executed directly but with detailed logs.
dangerous: Irreversible or global impact operations (e.g., rm -rf, git push --force), requiring user approval.

The design philosophy is few but versatile. Use one shell tool instead of separate npm_install, git_commit, and docker_run tools; use one file tool with five actions; use one search tool with seven types. This reduces the tool count from over thirty to about twelve, significantly shrinking the prompt and reducing the LLM's "choice paralysis." Special operations are handled via parameters.

Tools can register a compactTemplate for the Compaction phase to turn a full tool call into a one-line XML summary.

LLM Protocol Adaptation Layer

A major challenge is that different LLM providers (OpenAI, Anthropic, Gemini, local vLLM) have different protocols and tool-calling semantics. This encapsulation layer is crucial.

Three Tool Calling modes:

openai_native: Tools use API fields with strict name matching. Suitable for most models like DeepSeek, Kimi, and direct GPT connections. Mature protocol, but some proxies might have bugs.
anthropic_native: Uses Anthropic's API, supporting prompt cache and thinking fields. Only for direct Anthropic connections or compatible providers.
anthropic_xml: The most unique mode. Tool descriptions are injected into the system prompt, and responses are extracted via regex from the content. It doesn't rely on the API's tools field, making it a reliable fallback for problematic proxies that might drop fields when translating to OpenAI protocol. The cost is a need for robust regex parsing and slightly higher token overhead.

The XML Fallback mode implementation: The API request body doesn't include a tools field. Instead, tool descriptions are injected as a functions XML block in the system prompt, with rules telling the LLM to use a specific XML format for tool calls. The engine then parses function_calls from the assistant's content.

invoke.ts is the main entry point. It:

Resolves logical models to multiple providers ranked by priority (resolveProviderModels).
Implements a Circuit Breaker to temporarily skip failing providers.
Tries providers in failover order.
Handles provider quirks (e.g., different thinking field locations).

protocol-convert.ts handles conversion between OpenAI and Anthropic message formats. The most complex parts are the different positions of system messages, multimodal content formats, and the placement/format of tool_use and tool_result blocks.

response-parse.ts unifies all provider responses into a single LLMTurnResult structure, containing content, toolCalls, toolCallSource (native or xml), thinking, usage, and redacted messages for logs. The caller only interacts with this structure.

normalize.ts smooths out differences in provider responses (e.g., DeepSeek's thinking in reasoning_content vs. Anthropic's in a content block).

Secret Redaction: User-configured API keys or passwords are registered in a secrets table. Any literal matches in the context are replaced with environment variable references. Before calling the LLM, redaction ensures the model never sees the original secret. If the model generates a placeholder, it's restored before tool execution. This is a fundamental security principle to prevent secrets from entering LLM logs.

Prompt Cache Boundary: Most of the system prompt is stable (identity, capabilities, tool descriptions), while a small part is dynamic (date, user profile). Anthropic's prompt cache can save 90% of costs and latency but requires a stable cache boundary. This is achieved by inserting a zero-width invisible marker string. invoke.ts splits the prompt at this marker into a stable prefix (with cache_control) and a dynamic suffix.

Circuit Breaker: Maintains the success rate of the last N calls for each provider. If failures exceed a threshold, the provider is marked unhealthy and skipped for a cooling period. A "half-open" state then tests the provider before full restoration. This prevents the poor experience of waiting for repeated timeouts when a primary provider is down.

Search and Web Access Strategy

An Agent's research capability depends on its search tool design. Four core conclusions:

Don't use the model's native search; use a third-party search API (e.g., Tavily).
Search should only retrieve summaries and candidate URLs; use a browser tool to fetch full content as needed.
Categorize search tools by type.
Maintain full control over every web access.

Why not native search? It has four unacceptable issues: it's unobservable (you don't know the queries or pages), uncacheable (repeated searches can't be deduplicated), inflexible (you can't add site filters or time limits), and vendor-locked (Claude's range vs. GPT's parsing are black boxes). An independent tool provides complete control.

Search types include: info (general), news (appends "latest news"), research (appends "research paper"), api (appends "API documentation"), data (appends "dataset"), image, and tool. This forces the LLM to clarify its intent and improves recall quality.

Tavily is a recommended search implementation, providing structured results with LLM-friendly summaries. Alternatives include Bocha (optimized for Chinese sites like Zhihu and WeChat), Brave Search, Exa (for deep research), and Serper.

The search tool automatically appends a prompt guiding the LLM to use the browser tool to visit source URLs for full content. This two-step strategy (search for a list, browser for full text) is much more efficient than trying to search for full text in one go.

The browser toolset includes visit, view (accessibility tree), click, input, scroll, screenshot, and fetch_url (converting a URL to Markdown). fetch_url is the most common, ideally using a reader API (like Jina) to get clean Markdown, falling back to a headless browser (Playwright) if needed.

Four Architectural Redlines

The Engine must not alter the semantics of the LLM context: It can only append new entries or add metadata. Never modify already-written LLM-facing text.
Tool Handlers must not persist data directly: They only return structured ToolHandlerResult. This makes tools testable and allows for re-running tool calls.
Strict single-step per turn: Even if the LLM returns multiple tool calls, execute only the first and defer the rest. This prevents context explosion and simplifies error recovery.
All LLM-facing injections must be visible: System Reminders should be real entries in SessionContext, not temporary additions. This ensures that what you see in the logs is exactly what the LLM saw, enabling reliable debugging and replays.

Four "Don'ts" of Context Management

Don't bypass append to write directly to the database (sync issues).
Don't push temporary content to messages before calling the LLM without a database record (debugging nightmare).
Don't split history by message instead of by Turn (breaks tool_call/tool_result pairing).
Don't use LLM summaries as a substitute for re-reading original data (avoids bias and improves reliability).

Pitfalls to Avoid

Avoid over-engineering. For example, a heavy workstation sandbox mechanism designed for multi-tenant cloud deployment is unnecessary for a local Agent; direct execution on the host is sufficient. Provider portals and billing modules are also unnecessary initially. For the database, avoid complex multi-tenant schemes; a local Agent can start with SQLite and five core tables (tasks, task_context, tool_call_logs, sessions, llm_logs). Migration to PostgreSQL or MySQL is only needed for multi-user shared instances or when the database exceeds 50GB.

Path to Replication from Scratch

The process can be divided into six stages:

Stage 0: Environment and Scaffolding (0.5 days): Tech stack selection (TypeScript, Node 22+, pnpm, Vite, Vitest, Zod, Better-SQLite3, Drizzle ORM, Hono/Fastify).
Stage 1: Minimal Loop (1 day): A basic loop (user message -> tool call -> result) with in-memory context, supporting only OpenAI protocol and two tools (message, echo).
Stage 2: LLM Protocol Adaptation (1 day): Support for openai_native and anthropic_native modes.
Stage 3: Context Persistence and Compaction (1-2 days): Database storage and history compression.
Stage 4: Full Toolset and Policy (2-3 days): Production-grade tools and permission control.
Stage 5: Frontend, Streaming, and Resume (3-5 days): Streaming output, interjection support, and task recovery after power loss.

A demo using a command-line interface can be built in two weeks, with a full-featured version taking about three to four weeks.