Token reduction
Fewer tool calls
Faster · small & medium
Token savings per developer / yr
This document reports a structured benchmark comparing Claude Code alone against Claude Code augmented with Kerno Intelligence Tools (KIT). It was designed to answer one practical question: does adding Kerno produce a measurable difference in token consumption, latency, and output quality across representative developer tasks?
The benchmark covers three open-source TypeScript codebases and three open-source Python codebases, ranging from 20 to 500+ endpoints. In total 43 prompts were analysed, spanning routine lookups that represent day-to-day use through to heavier analytical tasks that represent low-frequency, high-effort work.
According to SWE benchmarks, Claude is the most capable AI coding agent. Given a codebase it has not seen before, it can locate endpoints, trace dependencies, and reason about structure — but it does so by exploring: reading files, running grep patterns, executing shell commands. On small codebases that works reasonably well. On larger or more complex ones, the model spends a significant portion of each session building context it could have started with.
KIT provides that starting context. It pre-indexes a codebase into a structural graph: endpoints, call chains, symbol references, dependency relationships. When Claude needs to answer a question, it queries Kerno's graph rather than exploring the filesystem directly. The practical effect is fewer tool calls, lower token consumption, and in several cases measurably more complete answers.
Claude augmented with KIT
Reduction in input tokens across all six codebases and both languages.
Fewer tool calls — baseline Claude used up to 13× more tool calls than augmented sessions.
In token savings per developer, annually — based on Claude's own analysis. KIT is free.
On blast-radius analysis, Kerno-augmented Claude surfaced an architectural risk the baseline response never reached — depth that token counts alone don't capture.
This was a manual benchmark, not an automated test suite. Both conditions — Claude Code with Kerno ("Kerno-augmented") and Claude Code without Kerno ("baseline") — were run by human operators against identical prompts and codebases. Human judgment was used to interpret some outputs, particularly the accuracy comparisons in Section 6.
Disclosure
The benchmark was not blinded: the operator knew which condition they were running. Readers should weigh this accordingly.
For the Kerno-augmented condition, two observability layers ran simultaneously. Langfuse provided LLM-level observability: full traces, per-request token counts (input, output, and cached), tool-call sequences, and span-level timing. Bifrost provided API-level data: latency per request, raw request/response logs, and aggregate session summaries. The two layers were cross-referenced to produce the per-prompt metrics reported here.
For the baseline condition, Claude Code console logs were the primary source for token counts and tool-call sequences. Bifrost timestamps aligned the two datasets so latency comparisons reflect the same prompt under comparable network conditions. Prompt caching was enabled in both conditions and held constant — it is not a variable between the two states.
Up to eight prompts were tested per codebase, grouped into two categories. Python codebases received a slightly adapted set: Express-specific queries were replaced with FastAPI equivalents, and the two TypeScript-only prompts (user-story generation and OpenAPI spec generation) were excluded from Python runs.
* For H5 we are interested in how Claude first derives the API structure to feed OpenAPI tools.
Up to eight prompts were tested per codebase, grouped into two categories. Python codebases received a slightly adapted set: Express-specific queries were replaced with FastAPI equivalents, and the two TypeScript-only prompts (user-story generation and OpenAPI spec generation) were excluded from Python runs.
Covers TypeScript and Python across six codebases. The two prompt sets differ slightly (Python omits user-story and OpenAPI prompts, and uses FastAPI instead of Express). Results for Go, Java, or Ruby have not been measured.
Enabled in both conditions, so token-cost comparisons reflect caching on both sides — the relative difference carries meaning, not the absolute counts in isolation. Cache-write costs are excluded.
Both baseline and augmented runs used Claude Sonnet 4.6.
No prior context in either condition: no READMEs, no claude.md, no injected memory, no custom tools beyond Kerno's. An intentionally conservative choice — it tests the default day-one experience. A developer with a tailored Claude configuration may see a different baseline.
Across all six codebases, two languages, and up to eight prompts per codebase, the token-consumption gap between Kerno-augmented and baseline Claude is consistent and large: between 88% and 99% fewer input tokens in every case. Tool-call reductions range from 61% to 87%.
The latency picture is more nuanced: Kerno is faster on small and medium codebases, but slower on large ones when generating comprehensive outputs. Both patterns are visible below and explained in the per-codebase commentary.
Token savings are consistent across both languages. The largest absolute saving is on Docmost (medium TypeScript): 6,708 tokens with Kerno versus ~236,000 without. Keep (medium Python) shows 98.9%: 2,200 vs ~205,800. Apache Airflow (large Python) produces the largest absolute saving in the benchmark: 25,300 vs ~330,000.
The large-codebase trade-off
On large codebases (Laudspeaker, Airflow) Kerno is slower on average. This is consistent: on large codebases, Kerno invests more time generating comprehensive structured output rather than returning partial results quickly. Whether that trade-off is favourable depends on whether completeness or raw speed is the priority for the task at hand.
The cost picture is consistent: Kerno sessions cost between $0.007 and $0.076, while equivalent baseline sessions cost $0.10 to $0.99. The most expensive Kerno session (Airflow, $0.076) costs less than the cheapest baseline session (Example-Python, $0.10).
This section presents per-prompt results, first for TypeScript (all 8 prompts), then Python (6 prompts). TypeScript token values use Bifrost session totals where available; Python values are summed from individual prompt logs. Baseline values come from Claude Code console logs in both cases. Where a metric shows no clear advantage for either condition, it is reported as-is.
Tests the ability to locate a framework or library within a codebase, a query a developer runs before changing dependencies.
Where the framework exists (medium, 34.9K tokens for baseline), Kerno serves the answer directly in 1 second from its index. The 47s on the large codebase reflects deeper graph traversal on a more complex dependency tree — not token-reading overhead.
On the small codebase, baseline Claude read 32K tokens to resolve symbol references; Kerno answered in 19s with zero file reads. On the large codebase, Kerno saved tokens and was 20 seconds faster.
Kerno was slightly slower on small/medium here, but the accuracy difference is significant — see Section 6, where Kerno-augmented Claude identified a circular dependency via forwardRef and annotated dead test code, while baseline produced a flat file list without role context.
The workspace scan shows the clearest speed advantage: 7s–24s versus 44s–58s for baseline, a 2–7× improvement. Baseline invoked 22–31 tools to reconstruct what Kerno's index already contains.
A quality-versus-speed trade-off. On medium and large codebases Kerno was slower, but the output was substantially more comprehensive — a detailed, grouped endpoint inventory. When completeness matters — documentation, auditing, security review — the latency increase purchases a categorically better answer.
On the medium codebase, the single most dramatic result: a full blast-radius analysis in 23s with zero file reads, while baseline spent 105s and 49K tokens for a less detailed output. The large codebase shows a slight reversal (61s vs 50s) attributable to tracing 13 eager-loaded relations across 140 endpoints — a finding Kerno surfaced explicitly as a key architectural risk.
Baseline produced a negative result on Docmost — correctly concluding no formal user stories exist, but missing that the endpoint structure itself encodes a complete user-story set. Kerno-augmented Claude used 2 API calls, found 90 endpoints, and produced 40+ user stories across 12 domains in 28s. The clearest accuracy gap in the benchmark.
A strong, consistent outperformance: on the small and medium codebases an 87–96% latency reduction, and even on the large codebase Kerno was faster (55s vs 67s) while cutting tokens 95%. On the medium codebase, baseline made 65 tool calls over 391 seconds. Claude's own assessment (Section 8) noted that Kerno's deep static analysis let it spot a spec error file-reading alone would have missed: a 403 response on POST /users/login that the spec documented only as 422.
The Python set covers six prompts: H1, H2, L1 (FastAPI location), L2 (route organisation), L3 (Auth Service files), and H3. Apache Airflow did not include L2. H4 and H5 were not run on Python.
Consistent across Python. Airflow shows the largest absolute saving: 10,000 vs 190,000 tokens, with latency improving from 120s to 52s.
The small Python codebase is an anomaly: with only 14 endpoints, direct file exploration is efficient enough that baseline wins. Airflow repeats the large-codebase pattern — Kerno takes longer (180s vs 45s) because it generates a substantially more complete inventory (15,300 output tokens).
Token savings are consistent (Kerno uses zero file-reading tokens in nearly all cases), but latency is more mixed than TypeScript. On Airflow (large), baseline is notably faster on lightweight queries — likely the complexity of graph traversal across 500+ endpoints, where the index query itself becomes non-trivial.
The blast-radius result on Keep is the strongest Python result in the benchmark: zero file-reading tokens, answered in 22s versus 44s, avoiding 4 tool calls that consumed 55,000 tokens.
A response that returns in 5 seconds with 500 tokens but answers the wrong question costs more than a 60-second response that answers correctly. This section compares what each condition actually produced, for three prompts where accuracy differences were most pronounced.
VERDICT
Baseline's negative result was technically accurate — there are no explicit user-story documents — but it missed the insight that an endpoint map is a user-story specification. The question a developer actually wants answered, "what does this product do, from a user's perspective?", was answered substantively only by Kerno-augmented Claude.
VERDICT
The baseline analysis was correct for what it found but incomplete: it traced the immediate call chain yet missed the auth layer, the serialization tree, and the architectural risk. Kerno-augmented Claude produced — in 61 seconds — an output that would take an experienced developer 15–30 minutes to construct manually.
VERDICT
Both responses identified the same five production files — the factual outcome was equivalent. The difference is depth and actionability. For a developer performing impact analysis before a change, the Kerno response is directly actionable; the baseline requires additional follow-up queries.
THE UNDERLINING MECHANISM
Kerno is not just saving tokens by reducing what Claude reads — it is changing which tokens Claude sees. Given a structured endpoint map instead of raw file contents, Claude reasons about architecture rather than text. Given a dependency graph rather than a flat file tree, it identifies risk rather than listing files. The token savings are the visible metric; the accuracy improvement is the practical value.
KIT replaces broad, expensive searches with precise, indexed lookups. Instead of reading entire files, agents jump directly to symbol definitions, find usages across a codebase, trace call chains, and list endpoints without touching unnecessary code.
KIT started as internal tooling — a lightweight code index we built to power Kerno's testing engine, not a standalone product. But as token costs climb and AI agents spend more time (and money) reading files they don't need, we felt it was worth sharing.
It's a stripped-down fork of SCIP with additional engineering to make it lightweight and fast. We have not designed it to replace deep indexing tools — others do that better. What it does is give your agent a smaller, faster path to what it actually needs: jump to a definition, find usages, trace a call chain, list endpoints without touching unnecessary code. Less noise, fewer tokens, faster results.
Add the Kerno MCP to your AI coding agent.
Using the benchmark's raw performance data, reduced token costs alone yield per-developer savings of $220–1,300 per year, rising to $500–4,000 once compound effects (smaller contexts producing better cache hits) are factored in. At 60 sessions per developer per week, that works out to $0.08–0.44 saved per session.
Scaled up, a 10-person team saves roughly $5,000–40,000 annually, while a 1,000-engineer organization saves a conservative $249,500–1,372,000 per year, plausibly reaching $4,000,000 at higher usage intensity once recovered developer time is included. The case is straightforward: token savings alone likely cover the cost of the tooling, before counting the time recovered by resolving queries in around 2 calls instead of 20 or more.
KIT installs as an MCP server for Claude Code, Cursor, or Codex in under five minutes — and it's free.