The Hidden Reason Most AI Projects Fail: Incorrect Caching
Why prompt caching is critical to your enterprise AI adoption
I got a call from a friend recently. We had not spoken for quite some time. He had reached out to me from one of my LinkedIn posts asking for people to share their experience with AI agents and agentic development. We spoke for some time and caught up on how our careers have been progressing. He is working as a Technology leader at a company and his mandate is to spend 50% of his time to actually lead AI initiatives and implementations with rest 50% to bread-butter platform. I was happy that the company is very keen to invest in AI tooling and AI agentic processes.
He mentioned some challenges that they were facing. I got curious, what challenges was he facing, since the company is into finance - a highly regulated industry. One of the projects (not coding agents) that he was leading worked on was turning out to be very complex. On further enquiry, turns out his harness had become very complex. Now this is natural since a lot of actual complexity comes from business logic and rules. But that was not the case. The complexity was introduced because of adding a rules based engine (this was the complexity) along with an agentic loop (That made sense. Typically harnesses do include both deterministic workflows as well as agentic loops.) The obvious next question was since the rule engine was getting complex, meant it needed intelligence so I asked if they had tried a pure agentic loop. Yes they had but the the cost of running the agent was exploding and the agent was slow to answer the queries, hence they had to introduce the rule based engine. The next question was which model? They were using Claude models. My spidey senes started tingling because I had been there and kind of struggled with it in early 2025. They had not implemented prompt caching. My immediate suggestion was to implement prompt caching and just try the agentic loop. I'm still waiting for him to share if it had any impact.
But Let that sink in. This is a common pitfall with Anthropic. Unlike most model providers like OpenAI and Google which provide automatic or implicit prompt caching, Anthropic needs you to be explicit about prompt caching.
You see that the input tokens of an AI agent keep increasing with every tool call it makes. Prompt caching helps you reduce input token price by 90% for the tokens the model has already seen.
Eg: Assume each loop adds an average of 500 new tokens to the prompt history. By the 100th loop, the model is no longer processing just the latest 500 tokens. It is processing the entire accumulated context from all prior loops.
That means the 100th request carries roughly:
500 × 99 = 49,500 tokens
At first glance, this may not look dangerous. If input tokens cost $3 per 1M, 49,500 tokens seems cheap in isolation. That is exactly where teams get fooled.
The cost is not linear in the way they assume. You are not paying only for the latest 500 tokens each turn. You are repeatedly paying for the growing full context on every loop.So the real cost is the cumulative total:
500 × (1 + 2 + 3 + ... + 99)
which is:
500 × 4,950 = 2,475,000 tokens
That is 2.475 million input tokens consumed across just 100 loops, before accounting for output tokens, tool calls, retries, or extra system context.
At $3 per 1M tokens, that is already about:
2.475 × 3 = $7.43
for a single long-running session.
Now multiply that across users, agents, retries, and production traffic, and your “cheap” AI workflow starts turning into an infrastructure bill you did not plan for.
Below is the table comparing caching and pricing impact for different popular llms.
| Model / platform | Automatic | Implicit | Explicit | Caching discount |
|---|---|---|---|---|
OpenAI recent models (gpt-4oand newer; prompt caching on 1024+ token prompts) | Yes | No separate “implicit” label | Optional routing hint via prompt_cache_key, but standard caching does not require explicit cache creation | Up to 90% lower input token cost on cache hits; docs also say lower latency up to 80% |
| Anthropic Claude | No | No | Yes, via cache_control | Cache reads cost 10% of standard input price; cache writes cost 1.25x input for 5-minute TTL or 2xfor 1-hour TTL |
| Gemini API (Google AI Studio / direct Gemini API) | Yes, for implicit caching on most Gemini models | Yes | Yes | Implicit: no savings guarantee. Explicit: priced separately with guaranteed savings; pricing page shows low context-caching rates plus storage charges, but the docs do not frame it as a single universal percentage across all models. |
| Gemini on Vertex AI | Yes, implicit caching is on by default for supported Gemini models | Yes | Yes | Implicit: 90% discount on cached tokens. **Explicit:**90% discount on Gemini 2.5+ models and 75% discount on Gemini 2.0 models; explicit caching also has storage costs. |
| MiniMax M2.7 / M2.5 / M2.1 series | Yes | Their docs call this passive prompt caching rather than “implicit” | Yes, Anthropic-compatible explicit caching also supported | Automatic/passive caching: discounted cache-hit pricing, example shows 75% savings. Explicit caching: model-specific prices; for example, M2.5 input is $0.3/M and cache-read is $0.03/M, which is a 90% discount; M2.7 cache-read is $0.06/M, an 80% discount. |
| Kimi / Moonshot API | Yes | Not documented under a separate implicit/explicit taxonomy in the surfaced docs | I did not find an official explicit cache-creation feature in the docs I checked | Official pricing docs say it supports automatic context caching and that cached tokens are billed at a cache-hit input rate. The exact discount is model-specific, so not one fixed percentage across all Kimi models. |
| Qwen on Alibaba Cloud Model Studio | Yes | Yes, implicit cache is always enabled on supported models | Yes | Implicit cache: cache hits billed at 20% of standard input price. Explicit cache: creation at 125% of standard input price, hits at 10%. **Session cache:**enabled by header, same billing as explicit. |
Unfortunately the Anthropic's api still requires you to explicitly opt in for prompt caching, albeit they made it simpler now by enabling a global request level flag to opt in, in addition to their convoluted way of specifying cache markers on messages.
Google still is opaque and difficult with caching. They do provide automatic caching but they don't specify the TTL and their explicit caching is complex to even understand.
This seemingly innocent mistake makes most AI projects in enterprises not move beyond a MVP. A shameless plug: I help teams by reviewing their agent implementations to avoid similar mistakes. Reach out to me if you need my help.
Have a similar problem to solve?
Use the contact page if you want help with agent architecture, evaluation, or implementation.
Get in touch