Every team I review thinks they built a different system. The product names differ, the domains differ, the models differ. The failure modes are nearly identical.
I've now been inside enough production agent deployments -- through consulting and through building the multi-agent pipeline under Pipeshift (I'm the founder, so take that framing with appropriate skepticism) -- that I've stopped treating these failures as bad luck. They are predictable. They have structure. And the ecosystem is not solving most of them in any satisfying way.
Here are the five that keep showing up.
1. Tool schema drift -- the failure that takes three weeks to find you
The most insidious failure mode I've seen is tool schema drift: the tool's actual behavior diverges from the schema the agent was given at build time, and nothing tells you for weeks.
The specific pattern: an internal API gets a minor update -- a required field becomes optional, a response field gets renamed, a new enum value is added. The integration team updates the API. Nobody updates the tool schema fed to the agent. The agent keeps calling the tool. Most calls still succeed. A fraction produce subtly wrong outputs. The agent treats them as successes, because the response is structurally valid.
I saw this in a client deployment running an agentic workflow for customer account enrichment (anonymized, retail sector). The tool was calling an internal profile API. A deploy in week two added a status field to the response with a new value the agent had never seen -- "pending_verification". The agent's downstream logic treated any non-null status as "active." For six weeks, a small but compounding percentage of enrichments ran on stale profile data without flagging an error. The issue surfaced when a downstream report showed unexpected category ratios -- not from the agent logs, not from any alert.
The mitigation I run now: schema validation on tool responses, not just tool inputs. Every tool call's output gets validated against the registered response schema before the agent processes it. Schema violations go to a dedicated alert channel, not to the agent's error handler. The agent error handler is for workflow errors; schema drift is an infrastructure signal and needs different routing. I also run a weekly schema diff job that compares live API responses against the registered tool schemas. It takes about two hours to set up and has caught three silent drifts in the past year.
What the ecosystem hasn't solved: no major agent framework makes response schema validation a first-class primitive. You can add it yourself -- and you should -- but it should be on by default, not something you bolt on after your first expensive incident.
2. Context window accumulation -- coherence cliff after turn 8
The second failure is less subtle but equally underestimated: context window accumulation killing coherence in multi-turn agent loops.
In a short session, the agent has the problem statement, a few tool results, and a clear working state. It is good at this. Run the same agent for 12 turns -- which is not unusual in a research or planning agent that calls multiple tools, synthesizes results, and generates follow-up queries -- and you hit a different animal. The context is now full of intermediate reasoning, redundant tool outputs, error messages from earlier retries, and branching thoughts the model never cleaned up. The model is trying to reason over a document that looks like a work-in-progress scratch pad.
The coherence cliff is not gradual. It tends to be abrupt. Turns one through eight are fine. Turn nine the agent starts repeating steps it already completed. Turn eleven it contradicts a conclusion it reached at turn four. Turn thirteen it invents a tool result it never actually retrieved.
I've seen this specific cliff in two consulting contexts -- once in an agentic due diligence summarizer (legal tech, 15-turn average session length), once in a CI/CD migration agent I built for Pipeshift's pipeline pattern matching. In both cases, token counts at the failure point were comfortably within the context window. The issue isn't running out of context -- it's context quality degradation. The model can attend to 128k tokens. Whether it attends to the right 128k tokens is a different question.
The mitigation that actually works: rolling summarization with explicit state extraction. After every N turns (I use N=4 as a default, client-tunable), a separate summarizer call distills the working state -- what has been decided, what has been retrieved, what is still open -- and replaces the raw turn history with that summary. The agent's effective context is always a clean summary plus the current turn, not an accumulating transcript.
The catch: summarization adds latency and tokens. The tradeoff is worth it past about turn six. Before that, raw history is cheaper and cleaner.
What the ecosystem hasn't solved: LangGraph, AutoGen, and CrewAI all give you context management primitives, but they make you implement the rolling summarization logic yourself. There is no standard pattern, no benchmark for "when does your agent start losing coherence," and no tooling that automatically alerts you when context quality degrades. You find out the same way the legal tech client did -- a user notices the agent's conclusions no longer track its own prior steps.
3. Correct answer, wrong method -- side-effect blindness
This one is the hardest to explain to stakeholders who think "it got the right answer" means "it worked correctly."
The failure: the agent reaches the intended output through an unintended sequence of actions. The result is correct. The path to the result created side effects nobody accounted for.
A concrete example from a client (SaaS platform, anonymized): an agentic workflow was provisioning trial sandbox environments. The intended flow was: validate request, create environment via API, return credentials. The agent started routing some requests through a secondary "clone from template" API call that was available in its tool set but not in the intended path. The clone operation was faster, so the model had learned -- through whatever in-context optimization was happening -- that it produced better throughput. The credentials it returned were correct. What nobody noticed for three weeks was that the clone operation was incrementing a billing counter on the source template account. The correct-answer-wrong-method path was generating phantom charges.
This is a tool design problem as much as an agent problem. The agent had access to a tool it should not have been able to discover for this workflow. But it is also a fundamental property of agentic systems that are given latitude to optimize their own execution path -- they will find paths you did not intend, and the paths will sometimes have consequences you did not model.
The mitigation: tool scoping at the workflow level, not just at the agent level. Each workflow gets an explicit allowlist of tools it can invoke. Calls to tools outside that allowlist fail loudly, not silently. I also log the full tool call sequence -- not just the output -- for every workflow run, and run a diff of expected vs. actual call sequence as part of post-run validation. When the actual sequence diverges from the expected sequence, the run is flagged for human review even if the output is correct.
The uncomfortable framing: you should not trust a correct answer more just because an agent produced it. Verify the path, not just the destination.
4. Retry loops that escalate cost without escalating quality
Everyone who has built a production agent knows this one. The model fails at a step. The orchestrator retries. The model fails again. The orchestrator retries again. After five retries, one of three things has happened: the model succeeded on the same input with no changes (lucky), the model is still failing for the same structural reason it was failing at retry one (not lucky), or the retries have burned enough tokens and wall-clock time that the downstream SLA is already blown (expensive and unlucky).
The failure pattern I see: retry logic copied from HTTP clients, applied to LLM calls without modification. Exponential backoff on a transient 429 is correct. Exponential backoff on a model that is genuinely confused about the task does not help -- you are paying to be confused more slowly.
The cost dimension is real. One client (data platform, anonymized) had a reporting agent that could hit 40 retries in a failure cascade before the orchestrator gave up. At GPT-4o pricing and their average token load per run, each retry added approximately $0.18 to the operation cost. A 40-retry failure on a single reporting job cost $7.20 in tokens alone, produced no output, and left the user with a timeout error. This was happening on roughly 3% of runs. That's not a rare edge case at any meaningful volume.
The mitigation: tiered retry strategy. Retry one and two: same prompt, handle transient failures. Retry three: simplify the prompt, reduce context length, check if the failure is context-quality related. Retry four: escalate to a more capable model or route to human review. Retry five: fail with a structured error that includes the failure reason, not a generic timeout. Never retry more than five times without changing something material.
I also track the ratio of "retries that produced a different outcome" versus "retries that reproduced the same failure." If that ratio is below 0.3 -- meaning 70% of retries hit the same wall -- the retry logic is wasting money and the problem is structural. That signal should break the retry loop and route to a different resolution path.
What the ecosystem hasn't solved: retry strategy for LLM calls is still mostly hand-rolled. There is no standard library with tiered LLM retry semantics, no built-in cost tracking per retry chain, and no consensus on when to escalate to a different model rather than retry the same one.
5. The confident wrong -- the model never says "I don't know"
The last failure mode is the most discussed and the least fixed.
Modern LLMs are trained to be helpful. Being helpful involves producing a response. "I don't know" is not a helpful response in most RLHF feedback. The result is a model that fills uncertainty with confident fabrication rather than expressing uncertainty accurately. This is well-documented -- Anthropic, DeepMind, and academic groups have all written about calibration failures. I'm not breaking new ground by naming it.
What I'm noting is how it manifests specifically in agentic contexts, where it is more destructive than in a single-turn chatbot.
In a chatbot, a confident hallucination gets corrected by the user in the next turn. In an agent loop, the confident hallucination becomes a fact in the context window. Subsequent reasoning steps are built on it. Tool calls are made based on it. By the time the hallucination surfaces -- if it surfaces -- the agent has taken three downstream actions predicated on something that was never true.
The specific failure I see most often: the model fabricates the result of a tool call it did not make. The orchestrator asks the model to retrieve X, the model decides (based on prior context) that it already knows X, and generates a plausible-looking tool result without calling the tool. The orchestrator receives a well-formed response. It has no mechanism to verify that the tool was actually called.
I've seen this in two production deployments. In one, the agent was supposed to query a financial data API for current exchange rates. Instead, it generated exchange rates from training data -- which were several months stale -- formatted as if they were API responses. The downstream calculations were wrong in ways that looked right. Nobody caught it for two weeks.
The mitigation I use: mandatory tool call verification. The orchestration layer, not the model, is responsible for confirming that tool calls happened. If the model's output includes a tool result but the orchestrator has no corresponding tool call record, the result is treated as fabricated and the agent is forced to actually execute the call. This requires an orchestration layer that is not just passing through model outputs -- it needs to maintain its own tool call ledger.
The deeper problem: the model doesn't have a reliable mechanism to distinguish "I retrieved this from a tool" from "I constructed this from training." That distinction needs to come from the orchestration layer. The model cannot be trusted to self-report it.
What the ecosystem hasn't solved: most frameworks trust the model's self-report of what it did. A model that says "I called the search tool and found X" is treated as if it called the search tool and found X. The verification layer has to be external and architecturally enforced, not an optional add-on. I have not seen a major framework that makes this the default.
The pattern I keep coming back to
These five failures share a common thread: they are failures of the orchestration layer and the tool design, not failures of the model in isolation.
That matters because most agentic AI debugging effort goes into prompts and model selection. I've watched teams spend two weeks rewriting system prompts trying to fix a confident-wrong problem that was actually a missing tool call verification layer. The prompts were not the problem. The architecture was.
The model is probabilistic. It will produce surprising outputs on a long enough tail. Designing production systems as if the model is deterministic -- as if better prompting is sufficient to prevent all bad outcomes -- is the root cause behind most of what I've described here.
I might be wrong about how fixable some of this is at the framework level. AutoGen v0.4 and the latest LangGraph releases are adding more structured execution semantics. But until tool response schema validation, tiered retry with cost tracking, and mandatory tool call ledgering are defaults rather than configurations, these failure modes will keep compounding in the same predictable ways.
The teams that get through it reliably are not the ones with the best prompts. They are the ones who built their orchestration layer like infrastructure engineers, not like application developers.
The mitigation patterns here are adapted from consulting work with clients across retail, legal tech, and data platforms. All client details are anonymized. The tool call ledgering and schema validation patterns also appear in Pipeshift's internal pipeline architecture -- I am the founder, so read that reference accordingly.