March 23, 20266 min readGaurav

Why agents fail and how to debug them?

If you are building agents seriously, you should at least know how to diagnose them.

People love to talk about agent failures as if they are some mysterious emergent property. In production, most failures are far more ordinary. Agents break because prompts drift from tooling, models behave differently than expected, tools hide failures, state goes stale, or the runtime keeps looping without making progress.

1. Ask the agent where it got stuck

One of the simplest ways to debug an agent is to ask it what difficulties it faced. In many cases, it will point directly at the thing that caused retries, delays, or wasted tool calls.

We saw this in Khelio. Our agent had to generate a custom DSL for artifacts. The harness generated IDs for every node in the DSL using the format type-integer. The model kept generating or referring to them as type_integer.

That tiny mismatch, hyphen versus underscore, was enough to waste multiple tool-calling loops. The result was higher token cost, more latency, and unnecessary round trips.

The fix was simple. We reinforced this exact edge case in the system prompt, and the problem largely disappeared.

Sometimes the issue is not reasoning quality. Sometimes the issue is syntax bias.

2. Prompt, tooling, and harness drift apart

A common source of agent failure is drift between the prompt, the available tools, and the harness behavior.

This gets worse when different teams own different parts of the stack. One team updates tools, another changes runtime behavior, and someone else tweaks prompts. Over time, the contract between them starts to break.

The agent then operates on outdated assumptions. It thinks tools behave one way while the runtime behaves another way. That mismatch creates avoidable failures.

If you care about reliability, you need a tight contract between:

system prompt
tool definitions
runtime behavior
error semantics
eval coverage

If those evolve independently, the agent degrades even when the model improves.

3. Model swaps are not free

A major mistake teams make is assuming the same system prompt will work across models.

That assumption gets weaker every month.

Different models respond differently to instructions, retries, tool schemas, and error messages. In my experience, OpenAI models tend to follow prompts more closely than the Claude family. The point is not that one is universally better. The point is that they are not interchangeable.

If you swap models without adapting prompts and harness behavior, you are introducing silent regressions.

The harness should absorb this complexity. A practical strategy is to maintain model-specific system prompts and routing logic backed by evals. That alone can improve reliability substantially.

4. Bad tool errors poison the whole loop

Tools need to be self-descriptive when they fail.

Silently failing tools are one of the biggest reasons agents go off the rails. If a tool fails without returning a clear, structured error, the model may assume the step succeeded. Later, when it reads the world state again, it finds a mismatch between what it believes happened and what actually happened.

That is when agents start replaying steps, duplicating actions, or getting trapped in recovery loops.

This gets worse when coding agents are used to write harness code. They often generate defensive wrappers, broad exception handling, fallback paths, or partial retries that accidentally swallow tool errors. The code looks robust on the surface, but what it really does is eat the failure and return an incomplete or misleading success path.

That kills the feedback channel.

Now the model does not see the real failure. It sees a distorted version of reality produced by the harness. Once that happens, every downstream step becomes less reliable.

A silently handled tool failure is corruption of the agent’s feedback channel.

This is why tool and harness code should be tested aggressively. Not just for happy paths, but for failure visibility. A failed tool call should fail loudly, in a structured way, with enough context for the model to recover intelligently.

5. Agents get stuck in loops without reaching terminal state

Sometimes the agent just gets confused.

It starts calling the same tools again and again without making real progress toward a goal. That wastes tokens, adds latency, and creates a bad user experience.

This is where deterministic runtime policy matters.

Do not leave loop management entirely to the model. The harness should periodically inspect execution state and detect patterns like:

the same tool being called repeatedly with near-identical inputs
no meaningful state change after multiple steps
repeated failure on the same precondition
oscillation between a small set of actions

Once you detect this, the runtime can intervene. You can inject more context, enforce a different policy, or have another model inspect the turn and steer the main agent.

6. Some models are reinforced for specific tool-use patterns

Tool use is not neutral across models.

Some models are implicitly or explicitly reinforced toward certain editing and interaction styles. Claude Code, for example, tends to work well with exact string-replacement style edits, while Codex-style systems are often more naturally aligned with patch-based workflows.

If your harness expects one style but the model is biased toward another, performance drops. The model may have enough reasoning ability, but it is fighting the interface you gave it.

You can catch this by asking the agent where it struggled, inspecting failed attempts, and looking for mismatch between your expected tool protocol and the model’s learned defaults.

7. Conversation history drifts from external state

Another serious failure mode shows up when the agent’s internal memory no longer matches the actual state of the world.

Suppose the agent updates some state. Then the user goes into the UI, changes content manually, uploads new images, or edits the artifact directly. Later, they come back and ask the agent to continue. In many systems, the agent has no awareness that anything changed outside its own short-term memory.

We saw this in Khelio. The agent would create an artifact, then the user would modify the artifact manually, such as changing content or uploading their own images. When they returned and asked the agent to make more changes, the agent assumed its internal state still matched external reality and often overwrote the user’s edits.

This becomes much more common when your application has multiple write paths. The more ways state can change outside the agent loop, the more likely the agent is to operate on stale assumptions.

Coding agents handle part of this problem well. They track when a file was last read by the harness and when it was last updated on the filesystem. If last_read_timestamp < last_updated_timestamp, the runtime forces the model to re-read the file before editing it.

The same pattern should be applied broadly. If an agent is about to write to state, it should verify that the state it read is still current.

A simple API that exposes recent state changes or edit history can also help the agent regain context before taking action.

The principle is straightforward: conversation history is not ground truth. External state is.

What agent debugging actually means

Agent debugging is not just prompt tweaking.

It is debugging the full control loop:

prompt quality
tool design
harness behavior
error visibility
state freshness
model-tool fit
eval coverage

When an agent fails, the useful question is: where did the contract break between reasoning, action, feedback, and state?

That is where most failures come from.

Practical takeaway

If you want agents to work reliably in production:

ask the agent where it struggled
treat prompt, tools, and harness as a strict contract
adapt prompts to the model instead of assuming models are interchangeable
make failures explicit and structured
detect and interrupt unrecoverable loops
align tool interfaces with model tendencies
verify freshness before writing to external state
invest in evals so regressions are visible

Have a similar problem to solve?

Use the contact page if you want help with agent architecture, evaluation, or implementation.

Get in touch