Introduction
Tools used: Koog | Sonnet 4.5
If you’ve ever wired up an agent that makes several tool calls in a row, or had long running chat sessions, you’ve probably hit the same wall many of us have: token usage blows up fast. And not in a fun “wow look at that throughput” kind of way - more like “why is my invoice suddenly a small novel?”
Here’s the core issue.
In a typical agent loop, every time the agent decides to call yet-another-tool, or every time you add a new message to the chat history, the entire history needs to go up with it. So if your initial message is reasonably small at only ~1,000 tokens, and each tool call or subsequent message adds another 1000 tokens, your input tokens on each message would look like this:
- 1,000
- 1,000 + (1 × 1,000)
- 1,000 + (2 × 1,000)
- …
- 1,000 + (5 × 1,000)
After just five iterations, you're already around 21,000 tokens — and this is before the agent starts making mistakes and retrying, which is exactly what happened in earlier tool-loop experiments.
The bad news: LLMs are extremely sensitive to this compounding cost.
The good news: remote prompt caching effectively nukes this entire problem.
How ‘local’ prompt caching fixes it for Kerno
Before we dive in, we use koog which offers local prompt caching.
Instead of resending the entire growing conversation every time the model needs context, prompt caching lets you store big static chunks remotely and reference them with lightweight handles. That means:
- Your agent stops re-ingesting huge prompts repeatedly
- Multi-step tool loops stop exploding in cost
- Recovery loops (where the model tries again after an error) stop compounding tokens
- Latency drops because there’s simply less text on the wire
For workflows involving numerous tool calls — think retrieval + parsing + transformation + validation — prompt caching switches the scaling curve from “exponential panic” to “linear and boring,” which is exactly what you want.
If your agent system ever melted down from context bloat or runaway retry loops, prompt caching is probably the missing piece you needed weeks ago.
Local vs. Remote Prompt Caching: What They Solve (and Why You Need Both)
So far we’ve talked about why prompt caching is essential for controlling token bloat in multi-step workflows. But it turns out there are two very different kinds of caching — and each solves a different part of the problem.
Local Prompt Caching (Koog-style)
Local caching is exactly what it sounds like:
you store the entire request/response pair on disk, and if you make the exact same request again, the system never hits the LLM — it just reads from disk.
Think of it like memoization:
input -> output (saved locally)
same input again -> return saved output instantly
This is great for:
- deterministic workflows
- testing
- repeated calls during development
- scenarios where identical prompts occur frequently
But it only works when your request is 100% identical.
And real chats aren’t identical.
Example: Why Local Caching Isn’t Enough
Imagine your conversation history looks like:
User: asks something
Assistant: responds
User: asks another thing
Assistant: responds
User: asks third thing
Assistant: responds
If you replay this exact sequence later, local caching works perfectly.
But in a real conversation, the next message is new:
User: asks something new <-- cache doesn’t have a response for this
At this point, there’s no cached result — you must hit the LLM. Local caching doesn’t know what comes after the divergence point.
.png)
Remote Prompt Caching (Provider-side)
Remote caching lives on the LLM provider’s side - and this is where things get powerful.
Most providers do this automatically, but Anthropic has an opt-in system: you can explicitly tell the model where in the chat history to cache.
Every token before that cache marker becomes:
- dramatically cheaper (often ~10× cheaper)
- slightly faster
- no longer reprocessed every call
Example: How Remote Caching Saves Money
Let’s annotate a conversation:
User: asks
Assistant: responds
User: asks
Assistant: responds
User: asks
Assistant: responds
^ Ask provider to cache up to here
Everything above this line is now cached.
Then you continue:
User: asks new question
Assistant: responds
As long as the sequence up to the cache point is identical, all those earlier tokens are now heavily discounted.
Where Remote Caching Helps
Full conversation history:
[ cached (cheap) ] [ new expensive tokens ]
^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^
This solves the exact weakness of local caching.
Local caching:
✔ skips repeated entire requests
✘ but only when they are identical
Remote caching:
✔ keeps long histories cheap even when conversations extend
✔ compresses the expensive context window
✔ prevents quadratic token growth during tool loops
✔ works even as the conversation evolves
Why This Matters for Costs (And Yes, It Gets Wild)
Without prompt caching, either local or provider-side, the costs of tool-driven LLM workflows can explode.
“Without at least this, we’d genuinely be losing hundreds of dollars per month per active paying user. Tool calling ramps up fast.”
But with Anthropic’s Sonnet 4.5 + proper caching:
“The example 3× implemented scenarios I was demoing is around $0.03 - about one cent per test implemented.”
That’s the difference between:
❌ a system that loses money per user
✔️ a scalable product where costs stay flat and predictable
And once traffic grows, you can start analyzing patterns, optimizing pipelines, and squeezing even more efficiency out of the system.
