At 2am on a Tuesday, one of my early agent systems was stuck in a loop -- tool call, failed parse, retry, failed parse, retry -- and the retries were not bounded. It had been running for forty minutes. Cost me around $14 in API tokens to learn that "fail loudly at the boundary" is not just a slogan. It is the specific design decision that determines whether you wake up to a $14 lesson or a $1,400 one.
I have been building agentic systems in various shapes since 2023 -- internal tooling, a billing reconciliation pipeline, pipeline analysis for Pipeshift (I am the founder, so take my Pipeshift references with appropriate skepticism), and several consulting engagements I cannot name specifically. The patterns below are what survived multiple production failures and a few rewrites. Some of it looks unusual until you have debugged a runaway agent. Then it looks obvious.
What I mean by "agentic system"
Before the patterns: I am not talking about chatbots with retrieval bolted on. I mean systems where an LLM chooses tools at runtime, where the execution path is not fully deterministic, and where the agent's actions have real-world side effects -- API calls, database writes, external notifications. That is the design space where architecture decisions actually matter.
If your system is "user sends message, retrieve context, generate answer," you do not need most of this. That is a retrieval-augmented generation pipeline and the design constraints are different. The patterns here are for systems where the agent is making decisions, not just lookups.
The orchestration layer: thin Python, no framework
My orchestration layer is approximately 200-300 lines of plain Python. No LangChain chains. No CrewAI. No AutoGen. I use LangGraph for graphs that have explicit state and conditional branching (the billing reconciliation pipeline I described in the LangGraph post uses it), but the outer orchestration -- the thing that decides whether to run the agent, what config to pass, where to route results, and what to do on failure -- is just Python.
The argument against frameworks at the orchestration layer is the same argument against any abstraction that hides control flow: when something breaks, you need to know exactly what is happening. Orchestration frameworks add indirection between "the agent did X" and "my code did X." That indirection costs you during debugging, which is when you can least afford extra complexity.
What my orchestration layer actually does:
- Loads agent config (tools, model, system prompt, token budget) from a typed dataclass, not a dict
- Runs a pre-flight check: are all tool endpoints reachable, is the API key valid, are required environment variables present
- Executes the agent loop with an explicit iteration cap
- Catches exceptions at the boundary and routes them to an error handler or escalation path
- Writes a structured execution record to persistent storage on every run, success or failure
from dataclasses import dataclass, field
from typing import Callable
@dataclass
class AgentConfig:
agent_id: str
model: str
system_prompt: str
tools: list[str]
max_iterations: int = 10
token_budget: int = 50_000
escalation_threshold: int = 3 # consecutive failures before escalate
@dataclass
class ExecutionRecord:
run_id: str
agent_id: str
started_at: str
finished_at: str | None
status: str # "success" | "failure" | "escalated" | "budget_exceeded"
iterations: int
tokens_used: int
error: str | None
escalated: bool = False
The ExecutionRecord is not optional. Every run writes one, regardless of outcome. If your orchestration layer does not produce a record for every run, you will have gaps in your debugging history at exactly the moments you need it most -- the failed runs.
Tool schema design: the contract is the interface
Tool schemas are the most underspecified part of most agentic systems I have consulted on. People write a function, wrap it in a decorator, and assume the LLM will figure out the right calling convention. Sometimes it does. When it does not, the failure is usually a malformed call that silently returns a bad result rather than an error.
My rules for tool schemas:
Rule 1: Every parameter has a description that includes what happens when it is wrong. Not just "the user ID" but "the UUID of the user -- if this is missing or invalid, the tool returns a 404 error object rather than raising an exception."
Rule 2: The return type is a typed structure, not a string. If the tool returns a string, the LLM has to parse natural language to extract information. If it returns a typed dict or a Pydantic model, the downstream processing is deterministic.
Rule 3: Error returns are explicit in the schema, not exceptions. The tool returns a success object or an error object. The calling convention is consistent either way.
from pydantic import BaseModel
from typing import Literal
class ToolSuccess(BaseModel):
status: Literal["success"]
data: dict
class ToolError(BaseModel):
status: Literal["error"]
code: str # machine-readable, e.g. "NOT_FOUND", "RATE_LIMITED"
message: str # human-readable
retryable: bool # whether the agent should retry this call
ToolResult = ToolSuccess | ToolError
The retryable field is the one that saves the most debugging time. When a rate limit returns retryable=True and a schema validation error returns retryable=False, the agent can make a local decision about whether to retry without that logic living in a prompt. It belongs in the tool contract, not in English instructions to the model.
Rule 4: Tool names are verbs, not nouns. get_user_profile not user_profile. search_documents not document_search. This is minor but it matters for the LLM's function selection reasoning. Noun-named tools produce more ambiguous selection behavior in my experience -- I cannot cite a paper on this, it is an observation from debugging tool selection logs.
Rule 5: Destructive tools have a dry_run parameter. Always. If the tool can delete, overwrite, or send a message to someone, dry_run=True should return what would happen without doing it. This is how I implement the human escalation path for high-stakes actions.
State management: what persists, what doesn't
The mistake I made on my first agent system: persisting everything. Full conversation history, all intermediate tool call results, every model response. This produced state stores that grew unboundedly and made replay and debugging nearly impossible because you could not reconstruct what the agent actually needed at any given point.
My current model: state has three tiers.
Tier 1: Ephemeral (in-memory, lives for one run). The current iteration's context window -- system prompt, conversation history for this run, tool call results from this turn. This is what the model sees. It does not persist anywhere. When the run ends, it is gone.
Tier 2: Run-scoped (persists for the lifetime of a single task). The execution record, any intermediate artifacts the agent produces that downstream steps need, the agent's working memory for a multi-step task. This persists to a scratch table or object storage with a TTL -- I typically use 7 days for debugging purposes, then purge. LangGraph checkpoint state lives here when I am using LangGraph.
Tier 3: Domain-persistent (permanent). The results of the agent's completed actions: the final output, any records created or updated, the audit trail of what the agent did and when. This lives in your actual application database. This is the only state tier that the application trusts for anything downstream.
The most important invariant: tier 1 state is never reconstructed from tier 3. The domain database is not a replay source for agent context. If you find yourself querying your application database to populate an agent's context window, you are mixing the tiers in a way that creates data consistency problems that are very hard to debug.
class AgentStateManager:
def __init__(self, run_id: str, db, scratch):
self.run_id = run_id
self._db = db # tier 3: domain-persistent
self._scratch = scratch # tier 2: run-scoped
def write_artifact(self, key: str, value: dict) -> None:
"""Tier 2 only. Artifacts for this run, not for downstream application logic."""
self._scratch.set(f"{self.run_id}:{key}", value, ttl_days=7)
def commit_result(self, result: AgentResult) -> None:
"""Tier 3. Only called on successful task completion."""
self._db.insert_agent_result(self.run_id, result)
def read_domain_context(self, context_key: str) -> dict:
"""
Explicit method for fetching domain data.
Calling this in a loop or inside a retry path is a bug.
"""
return self._db.get_context(context_key)
The separation looks verbose. It has saved me from three separate bugs where agent retry logic was re-fetching domain state mid-run and using a stale snapshot that had changed since the run started.
Error handling: fail loudly at the boundary
This is the decision that most changes how agentic systems behave under failure. The naive implementation catches exceptions inside the agent loop and lets the model decide what to do with them. The problem: the model's error recovery is only as good as your system prompt's instructions, and those instructions were written when the system was working correctly.
My rule: the agent loop does not handle errors. Errors surface to the boundary.
The boundary is the orchestration layer -- the code outside the agent loop that called it. That code knows what the overall task is, what the failure tolerance is, whether a human is reachable, and what to do if the task cannot complete. The agent does not know any of that. The model should not be deciding whether to retry an API call or escalate to a human. That is control flow, and control flow belongs in deterministic code.
def run_agent(config: AgentConfig, task: str, state_mgr: AgentStateManager) -> AgentResult:
"""
The boundary. Errors surface here, not inside the loop.
"""
try:
result = _agent_loop(config, task, state_mgr)
state_mgr.commit_result(result)
return result
except ToolError as e:
if e.retryable and config.retry_count < config.max_retries:
return run_agent(
dataclasses.replace(config, retry_count=config.retry_count + 1),
task,
state_mgr,
)
raise AgentBoundaryError(f"Tool failure after {config.retry_count} retries: {e}") from e
except TokenBudgetExceeded:
state_mgr.write_artifact("budget_exceeded", {"task": task, "tokens_used": config.token_budget})
raise AgentBoundaryError("Token budget exceeded -- task needs decomposition or budget increase")
except IterationCapExceeded:
raise AgentBoundaryError(f"Agent exceeded {config.max_iterations} iterations without resolving task")
What lives inside the agent loop: tool calls, context management, response parsing. What does not: retry logic, error routing, cost accounting. That lives at the boundary.
The consequence: when an agent fails, the stack trace terminates at a predictable location. You know exactly which boundary call failed, with what error, on which run ID. Compare that to an agent that catches its own errors and retries internally -- the failure surface is anywhere inside the loop, and the stack trace tells you less than the model's latest attempt to recover.
The human escalation path
Every agentic system I build has an explicit escalation path to a human. Not a fallback model. A human.
This sounds obvious. In practice, most systems I have reviewed do not have it. They have retry logic, fallback prompts, and graceful degradation to a canned response. Those are fine mechanisms, but they are not an escalation path. An escalation path means: a human receives a structured notification that the agent could not complete a task, with enough context to complete it manually or make a decision about how to proceed.
My implementation:
@dataclass
class EscalationEvent:
run_id: str
agent_id: str
task_summary: str
failure_reason: str
last_tool_call: dict | None
iterations_completed: int
tokens_used: int
suggested_action: str # what the agent would have done next
def escalate(event: EscalationEvent, channel: EscalationChannel) -> None:
"""
Writes to the escalation queue. Does not raise. Does not retry.
The orchestration layer calls this, not the agent loop.
"""
channel.send(event)
log.warning("escalated", run_id=event.run_id, reason=event.failure_reason)
The suggested_action field is the one that makes human escalations actionable rather than just alerting. When the agent escalates, it tells the human what it was about to do. The human can approve that action, modify it, or decide the whole task needs different handling. Without that field, escalation events are noise. With it, they are a handoff.
The escalation threshold in my AgentConfig is 3 consecutive failures. After 3 failures on the same tool, the agent does not get a fourth try -- it escalates. That number is tunable and I have used 2 for high-stakes operations (anything with write access to production systems) and 5 for lower-stakes background jobs.
What I'd do differently vs my first agent system
My first agentic system was built in early 2023, before LangGraph existed in its current form, using a ReAct loop implemented over the raw OpenAI function calling API. It worked for the demo. It did not work for anything close to production.
What I got wrong:
No iteration cap. The loop ran until the model decided to stop. Under normal conditions this was fine. Under any input that confused the model -- ambiguous task specification, a tool returning unexpected data, a prompt that set up a near-impossible sub-goal -- the loop would run to the API timeout. The fix is trivially simple (an integer counter) but I did not think about it until I was debugging the $14 Tuesday incident I mentioned at the top.
Tool errors raised Python exceptions inside the loop. When a tool failed, the exception propagated up through the agent loop and terminated the run without a structured record of what failed and why. I was reading raw tracebacks to understand agent failures. That is no way to operate a system.
State was just the conversation history list. No tiers, no typed schema, no separation between ephemeral context and persisted results. The agent's context window was the only record of what it had done. On retry, I had to reconstruct context manually.
No escalation path. When the agent failed, it failed silently into a log file. The user got a generic error message. I found out about failures by checking the logs, which means I was finding out after the fact.
System prompt was doing too much. The system prompt contained the full tool documentation, retry instructions, error handling guidance, task decomposition strategy, and output format requirements. It was over 2,000 tokens. Prompts that long are difficult to maintain, and mixing behavioral instructions with operational instructions (what to do on failure) means the operational logic lives in natural language rather than in code, where it belongs.
I am not embarrassed about that first system. Everyone's first agent system looks roughly like that in retrospect. The point is that each of those failures has a specific technical fix, and those fixes are what I have described in this post.
On framework choice
I get asked regularly whether to use LangGraph, CrewAI, AutoGen, or something else. My answer is usually the same: pick the simplest thing that handles your graph complexity, and put the operational logic -- iteration caps, error routing, state management, escalation -- outside the framework.
Frameworks handle graph execution and state flow well. They are not designed to be your error handling strategy or your escalation system. The teams I see struggling most with agentic systems in production are the ones who used the framework as the entire architecture rather than as the graph execution layer within a larger architecture.
For simple linear chains with no branching, I do not use any framework -- plain Python. For graphs with conditional branching and explicit state that needs to checkpoint, I use LangGraph. For multi-agent coordination with a planner and specialized sub-agents, I have used a thin orchestration layer that instantiates LangGraph subgraphs and routes between them -- I have not found an existing framework that handles this well enough to not roll my own.
I might be wrong about the thin-orchestration approach for very complex multi-agent topologies. I have not built anything with more than 5-6 specialized agents running concurrently. People building systems with 20+ agents may find that a richer orchestration framework pays off in ways my workload does not expose.
The patterns described here inform how I approach agentic system design at Pipeshift, which uses a multi-agent pipeline for CI/CD migration analysis. I am the founder -- any Pipeshift reference is not neutral. The error handling and state management patterns specifically come from debugging the billing reconciliation pipeline described in the LangGraph production patterns post.