← AI Coding Guides β€Ί Deep Dives
Architecture β€’ Tools β€’ Shell β€’ Protocols

How these coding agents are actually built

Under the surface, these repos cluster into a few distinct shapes: bespoke terminal runtimes, provider multiplexers, protocol bridges, and orchestration frameworks. Their tool design choices tell you which camp they belong to.

Generated Architecture Map
Hand-drawn diagram grouping coding agents into bespoke runtimes, provider multiplexers, protocol bridges, orchestration platforms, and special cases around a local repo snapshot.
This map compresses the repo set into the runtime families that keep showing up across the rest of the field guide.

(Alright, ad over. Back to the serious technical analysis.)

The eight architecture families

Bespoke terminal kernels

Claude Code, Crush, and Codex feel like software products first. Tool schemas, permissions, terminal UX, and recovery behavior are baked into the runtime rather than bolted on through a generic orchestration layer. Crush is uniquely notable for shipping native LSP integration (diagnostics and symbol references) and Sourcegraph code search as first-class built-in tools. Codex stands out as a Rust-native binary built around a Ratatui TUI, organized as a 70+ crate Cargo workspace, with platform-specific sandboxes (Seatbelt on macOS, bubblewrap on Linux, Windows tokens).

Provider-matrix CLIs

Mux, Neovate, and Qwen Code invest heavily in model registries, provider catalogs, and config resolution. The runtime exists partly to normalize many providers into one interface.

Protocol-heavy adapters

Pochi and Kimi CLI are notable for bridge code: MCP adapters, ACP translation, vendor packages, and importers from adjacent ecosystems.

Sandbox and orchestration platforms

DeerFlow and OpenHands focus on multi-step execution environments. They care about app servers, middleware, sandboxes, and delegating work as much as a single interactive CLI session.

Client-server agent runtimes

OpenCode adds a fifth shape to the landscape: a shared backend runtime that powers the TUI, browser console, desktop app, SDK, MCP layer, ACP layer, permissions, and worktrees.

Publishable framework ecosystems

ADK-Rust pushes harder than any other repo toward a crate-first design. Feature tiers, typed tools, graph agents, protocol crates, and optional frontier modules make it feel less like a product shell and more like a reusable Rust agent stack.

Power-tool terminal platforms

Oh My Pi sits between Pi Mono's kernel purity and OpenCode's platform runtime. It remains CLI-first, but layers in a Rust native engine, default hashline edits, MCP and plugin discovery, browser tooling, and task/subagent infrastructure.

Self-improving multi-platform agents

Hermes (Nous Research) is in a category of its own: persistent skill learning, multi-platform messaging gateways, RL training pipelines, and a MoA synthesis tool. Not a coding-only agent, but the most feature-rich architecture in the set.

How tool calls are represented

Repo Tool representation What stands out Editing style
Claude Code Typed internal tool modules with schemas, metadata, UI, and permission hooks Tools are product features, not just JSON functions. There are dedicated plan, task, worktree, slash-command, and MCP surfaces. Bespoke file tools and patch flows under a large central runtime
Crush Go implementations paired with self-describing tool docs Feels handcrafted. The tooling layer is readable, cohesive, and consistent with the terminal product. Custom file, shell, and LSP-oriented operations
Qwen Code Declarative tools translated into provider-facing function declarations The tool system is cleanly separated from model config, confirmation rules, and MCP discovery. Structured edit tools plus guarded shell execution
Neovate Code AI SDK-style tools with strong typing and bash guards Its bash tool is unusually opinionated about banned commands, substitutions, and risky patterns. CLI tools with explicit safety checks and MCP conversion
Pochi Built-in tool map first, MCP tools second Mixes direct tools, background jobs, diff application, and vendor-specific agent packages. Search/replace and diff-centric editing with worktree workflows
Kimi CLI Python tool modules and ACP conversion blocks Notable bridge layer that converts internal diffs and tool results into ACP-friendly structures. Terminal and edit tools shaped for protocol export
DeerFlow LangGraph/LangChain tool assembly from config, MCP, and subagent sources Tools are part of a harness. Middleware and role configuration matter as much as the tools themselves. Sandboxed file tools plus delegated subagent execution
OpenHands Legacy function-calling tools mapped into actions The local repo still shows classic CodeAct function tooling, but it is not the full story for the newer system. Sandbox actions such as bash execution and string-replace editing
Mux Broader workspace/runtime orchestration around provider-aware agents Less obviously tool-schema-centric in the docs I read; more focused on workspaces, execution environments, and provider routing. Workspace-driven execution with desktop/browser product framing
Hermes Python tool modules with COMMAND_REGISTRY autodiscovery pattern A single CommandDef list auto-derives CLI autocomplete, Telegram menu, Slack subcommand map, and gateway help simultaneously. MoA and Delegate tools are unique in this set. Standard read/write/shell tools plus skill_manager, MoA synthesis, sub-agent delegation
Pi Mono TypeBox JSON schemas with AJV validation, 7 core tools Multi-disjoint edits per call, file mutation queue, fuzzy matching (Unicode normalization), uniqueness validation. Extensions can register custom tools with full TypeBox schemas. Precise text replacement (not diffs) with reverse-order application, line ending preservation, BOM handling
Oh My Pi Mode-routed tool layer with native modules and prompt helpers One edit tool fans into replace, patch, hashline, vim, and apply_patch. Prompt helpers generate real anchor examples, and Rust-backed modules accelerate heavy tool paths. Hashline anchors by default, with other edit modes still available when the situation calls for them
OpenCode AI SDK tools plus dedicated runtime services Pairs a heavyweight apply_patch tool with a smaller exact-match edit tool, then wires both into a permission bus, format-on-write hooks, and runtime events. Patch-based edits for structural changes, exact string replacement for fast paths
ADK-Rust Rust traits, #[tool] macro, provider-native built-ins Tool definitions are compile-time types first. Anthropic and OpenAI built-ins are surfaced as native provider contracts, while broader capabilities live in separate crates. Provider-native editor wrappers with thin local executors, not a giant house patch protocol
Codex codex-tools crate with ToolSpec/ToolDefinition, JSON Schema via schemars, 20+ built-in tools Rust-native tool definitions derive JSON Schema automatically. The tool registry is tightly integrated with the Ratatui TUI and permission model. Structured tool calls with Rust-type safety, integrated with shell and file operations

The big divider is whether tools are treated as a neutral transport format or as a first-class product surface. Claude Code and Crush are on the product side of that line.

Shell and CLI execution

Repo Shell model Background support Guardrails
Claude Code Rich shell and process tooling inside the main runtime Yes Permission modes, explicit tool policies, and runtime-level orchestration
Neovate Code Bash tool with timeout, truncation, and risk checks Some long-running cases Banned commands, command-substitution checks, and high-risk detection
Qwen Code Shell tool with command parsing and read-only detection Yes Permission decisions and shell classification before execution
Pochi Command execution plus explicit background-job support Yes Separate tools for foreground commands and long-running jobs
Kimi CLI Fresh shell-oriented tool calls plus shell mode UX Yes, via task management Task-oriented UX and protocol-aware terminal capability handling
Crush Native Go shell service integrated with permissions and TUI Yes Product-level permissions, service boundaries, and custom runtime control
DeerFlow Sandboxed bash-like operations inside a harness Yes Middleware plus sandbox abstractions keep execution constrained
OpenHands Sandbox bash actions exposed to the agent Yes Isolation comes from the runtime sandbox more than the tool schema itself
Mux Workspace and runtime execution, often closer to a persistent environment Yes Provider routing is central; shell policy is less front-and-center in the docs than in Neovate or Claude
Hermes Shell tool plus 6 remote execution backends: local, Docker, SSH, Daytona, Modal, Singularity Yes Tirith binary verifies execution environment authenticity via SHA-256 + cosign provenance before runs
Pi Mono Pluggable BashOperations interface with streaming output, detached process trees Yes (via detached processes + killProcessTree) Extension-based (BashSpawnHook), commandPrefix option, no built-in bans β€” security is the user's responsibility (container or extension)
OpenCode Runtime-managed shell tool with shell-family detection and metadata capture Yes Permission bus, queued approvals, Windows cleanup handling, and transport-aware command execution
Codex zsh-fork backend (macOS), Unix escalation (others), execpolicy rule engine Yes Platform-specific sandboxes: Seatbelt (macOS), bubblewrap (Linux), Windows tokens; execpolicy rule engine for fine-grained command authorization

Shell security depth comparison

The agents vary enormously in how much engineering goes into preventing the bash tool from doing damage. Here is the full spectrum from most to least defensive:

Claude Code β€” Tree-sitter AST analysis and Zsh attack catalog

Claude Code imports tree-sitter shell grammar to parse command ASTs before execution. The bashSecurity.ts file contains a catalog of Zsh-specific attack vectors:

Attack pattern What it does How detected
zmodload Loads Zsh modules including network, file descriptor, and cryptographic modules Command name prefix match
emulate -c Evaluate code in a sub-shell with emulated environment β€” effectively eval Flag pattern match
sysopen / sysread / syswrite Low-level file descriptor operations from zsh/system module Command name match
=cmd (EQUALS expansion) Resolves to the full path of cmd, bypassing binary-name blocklists AST token shape
<() / >() Zsh process substitution, creates anonymous FIFOs AST subtree match
$() / backtick substitution Classic command substitution β€” but caught via tree-sitter, not regex AST node type
<# PowerShell-style comment, unexpected in a bash context β€” flags context confusion Token match

Neovate Code β€” Quote-aware pipeline parser

Neovate's bash tool uses a character-level state machine to handle quoting correctly before any security check:

// State machine tracks: inSingleQuote, inDoubleQuote, escaping
// splitPipelineSegments() respects quoting so 'echo "a|b"' is ONE segment
// hasCommandSubstitution() tracks same states to find $() and backticks

The hard-coded banned commands include some surprises beyond the obvious: aria2c, axel, curlie, http-prompt, httpie, links, lynx, w3m, xh (all web fetchers), plus shell alternatives bash, sh, fish, zsh, and the dangerous utility trio nc, telnet, eval.

Hermes β€” Binary provenance verification

Hermes takes a different approach: rather than blocking specific commands, it verifies the execution environment. The Tirith security binary is downloaded from GitHub and its SHA-256 hash is verified against a known-good value. On supported platforms, cosign provenance attestation is also checked. This is supply-chain security, not just command blocking.

Crush β€” Product-level permission gates

Crush integrates a permission system directly into its TUI. Before executing dangerous operations, the user sees a permission prompt. This is a UX-level defense rather than a parse-level one β€” appropriate for an interactive tool where the user is present.

Loop and stuck detection

What happens when an agent is spinning its wheels? Most agents don't have explicit detection. Two in this set do:

Crush β€” SHA-256 tool signature hashing

For each step, Crush computes a signature by SHA-256 hashing the concatenation of tool_name + "\x00" + tool_input + "\x00" + tool_output for every tool call in that step. It slides a window over the last 10 steps. If any signature appears more than 5 times in that window, the agent is halted as stuck.

This is robust: calling the same tool with different arguments gets a different hash. Calling it with the same arguments but getting different output (e.g., due to a flaky command) also gets a different hash. Only genuine repetition triggers the halt.

// internal/agent/loop_detection.go
windowSize = 10  // last N steps to check
maxRepeats  = 5  // halt if any signature appears this many times

Hermes β€” Per-session trajectory tracking

Hermes exports full trajectory data in <tool_call> XML tags wrapping JSON β€” a format that matches Nous Research's Hermes model fine-tuning data format. This trajectory data can be used post-hoc for RL training to reward or penalize specific action sequences.

The context compressor's iterative re-compression also detects when the same summary is being generated repeatedly (diminishing new information), allowing the RL training signal to identify "summary convergence" as a proxy for being stuck on a task.

πŸ’‘

Why most agents don't have this

Most agents rely on the model to notice it is repeating itself. Crush's explicit detection is a sign of production experience: models sometimes don't notice, and users definitely don't want to watch 50 identical tool calls scroll by.

Prompt caching and context compaction are separate layers

These repos keep surfacing the same architectural split: prompt caching tries to preserve a stable prefix so the provider can reuse it, while context compaction rewrites history so the next request fits in the model window. The best harnesses treat those as cooperating but distinct systems.

Problem Primary mechanism Strong examples
Keep repeat turns cheap Provider-side prompt caching Claude Code snapshot, Dirac, DeerFlow, ADK-Rust, OpenCode
Make the next request fit Compaction, folding, pruning, summarization Claude Code snapshot, Goose, Pi Mono, Neovate, DeerFlow, Reasonix
πŸ”—

New deep dives

See Prompt Caching That Actually Reaches the API for provider-side cache mechanics, and When the Window Fills Up for the compaction/folding side of the story.

Context compression strategies

When context windows fill up, the serious repos do more than "summarize everything." The dominant pattern is layered defense: trim or prune obvious waste first, preserve a recent tail, then insert a structured summary or folded history only if the request is still too large.

Agent Strategy Key detail
Claude Code Auto-compaction plus reactive retry Threshold-triggered compaction lives inside the runtime, and prompt-too-long failures can fall back to reactive truncation rather than just aborting.
Goose Message compaction plus tool-pair summarization Summaries become agent-visible continuation state while the original messages can remain user-visible.
Pi Mono Structured compaction with tracked file state Preserves goal/progress/decisions plus read-file and modified-file carry-forward across compactions.
Neovate Code Pruning before compaction Uses triggerRatio, protected tools, recent-turn preservation, and a separate pruning phase before full compaction.
Open Claude Code Micro-compaction before full summary Truncates stale tool results first, then replaces older history with a short summary if that still is not enough.
DeepSeek Reasonix History folding plus emergency truncate Real folding logic keeps a recent tail while preserving pinned skills and pinned constraints; byte limits matter alongside token limits.
DeerFlow Summarization middleware Middleware chooses cut points and rescues recent skill-file reads before summarization.
Weak evidence cases Do not overclaim Hermes, OpenHands, ADK-Rust, and Oh My Pi should be treated more cautiously than earlier first-pass summaries suggested.

Agent state machines and retry strategies

How agents handle the "what to do next" decision after each turn reveals their architectural maturity. Three agents have explicit named state transitions:

Claude Code β€” 10 terminal states, 8 continue reasons

The query engine (src/query/transitions.ts) is a named state machine with explicit exit reasons. Terminal exits: 'completed', 'blocking_limit', 'image_error', 'model_error', 'aborted_streaming', 'aborted_tools', 'prompt_too_long', 'stop_hook_prevented', 'hook_stopped', 'max_turns'.

Continue reasons (agent loops back): 'tool_use', 'reactive_compact_retry', 'max_output_tokens_recovery', 'max_output_tokens_escalate', 'collapse_drain_retry', 'stop_hook_blocking', 'token_budget_continuation', 'queued_command'. Each is a distinct named transition, not a generic "keep going" flag. Token budget fires at COMPLETION_THRESHOLD = 0.9; if per-check delta falls below DIMINISHING_THRESHOLD = 500 tokens three times, the agent is considered complete.

OpenHands β€” Temperature bumping on dead LLM responses

OpenHands' retry mixin (openhands/llm/retry_mixin.py) uses the tenacity library with a documented intentional quirk: on LLMNoResponseError when temperature is 0, it automatically sets temperature = 1.0 on the next attempt.

The reasoning: a fully deterministic model (temp=0) that returns nothing is stuck in a degenerate fixed point and will return nothing again. Adding randomness breaks the loop. This is one of the more thoughtful LLM retry patterns in the set β€” it adapts the request rather than just retrying identically.

DeerFlow β€” 200-line buckets to prevent false loop detection

DeerFlow's LoopDetectionMiddleware hashes tool name + input + output to detect repetition. But for read_file, line numbers are bucketed into 200-line groups before hashing β€” because paginated file reads of the same file look identical to the naive algorithm, but are legitimate progress.

On warn (3 repeats): inject HumanMessage("you are repeating yourself β€” wrap up"). On hard limit (5 repeats): strip tool_calls entirely from the response, forcing a plain-text answer and ending the loop definitively.

Qwen Code β€” Truncation recovery with Levenshtein validation

When the model's response is truncated mid-tool-call, Qwen Code (coreToolScheduler.ts) injects TRUNCATION_PARAM_GUIDANCE ("your previous response was truncated due to max_tokens…") to ask the model to retry, and returns TRUNCATION_EDIT_REJECTION ("tool call has been rejected to prevent writing truncated content") for any edit tool where the output is incomplete.

The scheduler imports both diff and fast-levenshtein to verify that proposed file edits are not corrupted: if a diff-patch looks syntactically valid but the Levenshtein distance between the "before" and "after" is implausible, the edit is rejected. modifiable-tool.ts additionally lets tool calls be edited in-flight by the user before execution.

Hooks and lifecycle events

Tool hooks β€” code that runs before and after tool calls β€” let external systems observe or modify agent behavior without patching the core. Three agents in this set have them as first-class concepts:

Kimi CLI β€” Three-event hooks system

The hooks/events.py module defines three events per tool call:

  • pre_tool_use(tool_name, input) β€” runs before the tool; can modify or block the call
  • post_tool_use(tool_name, input, result) β€” runs after success; can observe output
  • post_tool_use_failure(tool_name, input, error) β€” runs on failure; allows custom error handling

This hooks API enables telemetry, authorization checks, result caching, and test mocking without touching tool implementation code.

Hermes β€” Security scan hooks on skill saves

Hermes does not have a general hooks API, but it has a specific security-scan hook on the skill learning path: before any SKILL.md file is written to ~/.hermes/skills/, the content is scanned for prompt injection patterns and invisible Unicode. Similarly, MEMORY.md and USER.md are scanned on every load.

This is a security-specific hooks pattern rather than a general lifecycle API.

Pi Mono β€” 20+ lifecycle events via extensions

Pi's extension system exposes 20+ events via the ExtensionAPI: agent_start, agent_end, tool_call, tool_result, beforeToolCall, afterToolCall, message_start/end, turn_start/end, session_start/compact/tree, model_select, before_provider_request, and more.

Extensions can block, modify, or augment any tool execution. Event handlers return promises that are awaited in order, enabling synchronous interception. Custom tools, providers, renderers, widgets, and status lines are all registered through the same event-driven API.

MCP, ACP, and bridge strategy

MCP leaders

Claude Code, Qwen Code, Neovate, Pochi, and DeerFlow all show serious MCP handling. The difference is emphasis:

  • Claude treats MCP as part of a larger integrated runtime.
  • Qwen and Neovate handle discovery, connection state, and health more explicitly.
  • Pochi blends MCP with vendor and agent ecosystem imports.
  • DeerFlow folds MCP into a composable extensions system.

ACP specialist

Kimi CLI is the clearest protocol-bridge project in this set, but OpenCode is the most runtime-shaped ACP implementation. Kimi maps internal edits and results into transportable blocks; OpenCode uses ACP for session lifecycle, permission requests, usage updates, and forked work.

Bidirectional MCP

Codex supports MCP in both directions: as a client via the codex-mcp crate (connecting to external MCP servers) and as a server via codex-mcp-server, exposed through the codex mcp-server CLI command. This lets Codex both consume external MCP tools and expose its own tools to other MCP clients.

⚠️

OpenHands is a qualified case

The local repo contains architecture, runtime, sandbox, and legacy tool/action patterns, but the project itself says the main V1 agentic core moved elsewhere. Any MCP or tool comparison for OpenHands should be read as a snapshot of the local repo, not the whole current product story.

Error handling and recovery patterns

Claude Code

Recovery is systemic: permissions, retries, tool-specific controls, command systems, and a dedicated runtime all participate. It feels designed around failure as a normal operating condition.

Neovate Code

Strong at pre-emptive failure avoidance. Its shell code tries hard to stop dangerous or malformed commands before they ever run, and its MCP manager tracks retries and connection state explicitly.

Qwen Code

Particularly good at configuration and state management. Model resolution has source precedence, runtime snapshots, and rollback-ish handling that makes the system easier to reason about.

DeerFlow

Uses middleware to absorb failure modes: clarification interrupts, dangling tool-call handling, summarization, and subagent limits. That is framework-style robustness rather than CLI-style robustness.

Crush

Calm, explicit Go-style error boundaries. Provider metadata caching and service organization make failure paths easier to trace than in sprawling dynamic runtimes.

Kimi CLI

More honest than flashy: the ACP docs document current limitations, which is often a sign of a healthy protocol mindset rather than a polished marketing wrapper.

Pi Mono

Error handling is multi-layered: tool-level errors return isError: true to the LLM for self-retry, abort signals trigger graceful cleanup (temp files closed, partial results returned), compaction failures auto-retry, and extension errors are caught and emitted via emitError() without crashing the agent. The file mutation queue prevents concurrent write races entirely.

OpenCode

The runtime combines permission queues, transport-specific MCP failures, diff-aware patch validation, session events, and format-on-write hooks. Errors are treated as state flowing through the bus rather than one-off exceptions buried inside CLI code.

Codex

Strict clippy configuration denies unwrap_used and expect_used across the workspace, forcing explicit error handling at compile time. Runtime recovery includes a 300s idle timeout and up to 5 stream retries before giving up.

What architecture tells you about product intent

The repos that feel the strongest are the ones whose architecture matches their promise. Claude Code promises an elite terminal coding partner, and the repo looks like a full custom runtime. Crush promises a serious CLI product, and the repo looks handcrafted for that. Qwen promises a configurable multi-provider CLI, and its model-resolution machinery backs that up. Hermes promises a self-improving multi-platform agent, and its skills system, RL infrastructure, and gateway adapters all back that up. Pi Mono promises a minimalist extensible kernel, and its 874-file refreshed codebase with differential TUI, tree sessions, and Pi Packages delivers exactly that. OpenCode promises a client-server coding runtime, and the repo matches that promise with a shared backend spanning TUI, web, desktop, MCP, ACP, and worktrees.

The weaker-feeling designs are not necessarily bad; they are often just trying to do a broader or less opinionated job. Pochi is ambitious and eclectic. DeerFlow is powerful but framework-heavy. OpenHands is split across repo boundaries. Kimi prioritizes bridge fidelity over terminal theatrics. Those are different goals, and the code reflects them.