How Local Prompt Caching Reduces Tokens in Tool-Driven LLM Workflows

December 1, 2025

Introduction

Tools used: Koog | Sonnet 4.5

If you’ve ever wired up an agent that makes several tool calls in a row, or had long running chat sessions, you’ve probably hit the same wall many of us have: token usage blows up fast. And not in a fun “wow look at that throughput” kind of way - more like “why is my invoice suddenly a small novel?”

Here’s the core issue.

In a typical agent loop, every time the agent decides to call yet-another-tool, or every time you add a new message to the chat history, the entire history needs to go up with it. So if your initial message is reasonably small at only ~1,000 tokens, and each tool call or subsequent message adds another 1000 tokens, your input tokens on each message would look like this:

1,000
1,000 + (1 × 1,000)
1,000 + (2 × 1,000)
…
1,000 + (5 × 1,000)

After just five iterations, you're already around 21,000 tokens — and this is before the agent starts making mistakes and retrying, which is exactly what happened in earlier tool-loop experiments.

The bad news: LLMs are extremely sensitive to this compounding cost.
The good news: remote prompt caching effectively nukes this entire problem.

How ‘local’ prompt caching fixes it for Kerno

Before we dive in, we use koog which offers local prompt caching.

Instead of resending the entire growing conversation every time the model needs context, prompt caching lets you store big static chunks remotely and reference them with lightweight handles. That means:

Your agent stops re-ingesting huge prompts repeatedly
Multi-step tool loops stop exploding in cost
Recovery loops (where the model tries again after an error) stop compounding tokens
Latency drops because there’s simply less text on the wire

For workflows involving numerous tool calls — think retrieval + parsing + transformation + validation — prompt caching switches the scaling curve from “exponential panic” to “linear and boring,” which is exactly what you want.

If your agent system ever melted down from context bloat or runaway retry loops, prompt caching is probably the missing piece you needed weeks ago.

Local vs. Remote Prompt Caching: What They Solve (and Why You Need Both)

So far we’ve talked about why prompt caching is essential for controlling token bloat in multi-step workflows. But it turns out there are two very different kinds of caching — and each solves a different part of the problem.

Local Prompt Caching (Koog-style)

Local caching is exactly what it sounds like:
you store the entire request/response pair on disk, and if you make the exact same request again, the system never hits the LLM — it just reads from disk.

Think of it like memoization:

input -> output (saved locally)

same input again -> return saved output instantly

This is great for:

deterministic workflows
testing
repeated calls during development
scenarios where identical prompts occur frequently

But it only works when your request is 100% identical.

And real chats aren’t identical.

Example: Why Local Caching Isn’t Enough

Imagine your conversation history looks like:

User: asks something

Assistant: responds

User: asks another thing

Assistant: responds

User: asks third thing

Assistant: responds

‍

If you replay this exact sequence later, local caching works perfectly.

But in a real conversation, the next message is new:

User: asks something new <-- cache doesn’t have a response for this

At this point, there’s no cached result — you must hit the LLM. Local caching doesn’t know what comes after the divergence point.

Remote Prompt Caching (Provider-side)

Remote caching lives on the LLM provider’s side - and this is where things get powerful.

Most providers do this automatically, but Anthropic has an opt-in system: you can explicitly tell the model where in the chat history to cache.

Every token before that cache marker becomes:

dramatically cheaper (often ~10× cheaper)
slightly faster
no longer reprocessed every call

Example: How Remote Caching Saves Money

Let’s annotate a conversation:

User: asks

Assistant: responds

User: asks

Assistant: responds

User: asks

Assistant: responds

^ Ask provider to cache up to here

Everything above this line is now cached.

Then you continue:

User: asks new question

Assistant: responds

As long as the sequence up to the cache point is identical, all those earlier tokens are now heavily discounted.

Where Remote Caching Helps

Full conversation history:

[ cached (cheap) ] [ new expensive tokens ]

^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^

This solves the exact weakness of local caching.

Local caching:
✔ skips repeated entire requests
✘ but only when they are identical

Remote caching:
✔ keeps long histories cheap even when conversations extend
✔ compresses the expensive context window
✔ prevents quadratic token growth during tool loops
✔ works even as the conversation evolves

Why This Matters for Costs (And Yes, It Gets Wild)

Without prompt caching, either local or provider-side, the costs of tool-driven LLM workflows can explode.

“Without at least this, we’d genuinely be losing hundreds of dollars per month per active paying user. Tool calling ramps up fast.”

But with Anthropic’s Sonnet 4.5 + proper caching:

“The example 3× implemented scenarios I was demoing is around $0.03 - about one cent per test implemented.”

That’s the difference between:

❌ a system that loses money per user
✔️ a scalable product where costs stay flat and predictable

And once traffic grows, you can start analyzing patterns, optimizing pipelines, and squeezing even more efficiency out of the system.

Subscribe to our blog

Get the latest technical guides and product updates delivered to your inbox.

Subscribe to the AI Builder Series

Get a weekly roundup of practical guides, tools, and insights for building AI-native products.

You're in!

Oops! Something went wrong while submitting the form.