Kerno Intelligence Tools (KIT) — Benchmark · May 2026

Comparing Claude's grep capabilities versus Claude augmented with Kerno intelligence tooling

TypeScript and Python Edition

88–99%

Token reduction

61–92%

Fewer tool calls

2–7×

Faster · small & medium

$4,000

Token savings per developer / yr

01 / EXECUTIVE SUMMARY

Does adding Kerno to Claude Code make a measurable difference?

This document reports a structured benchmark comparing Claude Code alone against Claude Code augmented with Kerno Intelligence Tools (KIT). It was designed to answer one practical question: does adding Kerno produce a measurable difference in token consumption, latency, and output quality across representative developer tasks?

The benchmark covers three open-source TypeScript codebases and three open-source Python codebases, ranging from 20 to 500+ endpoints. In total 43 prompts were analysed, spanning routine lookups that represent day-to-day use through to heavier analytical tasks that represent low-frequency, high-effort work.

Background

According to SWE benchmarks, Claude is the most capable AI coding agent. Given a codebase it has not seen before, it can locate endpoints, trace dependencies, and reason about structure — but it does so by exploring: reading files, running grep patterns, executing shell commands. On small codebases that works reasonably well. On larger or more complex ones, the model spends a significant portion of each session building context it could have started with.

KIT provides that starting context. It pre-indexes a codebase into a structural graph: endpoints, call chains, symbol references, dependency relationships. When Claude needs to answer a question, it queries Kerno's graph rather than exploring the filesystem directly. The practical effect is fewer tool calls, lower token consumption, and in several cases measurably more complete answers.

Key findings

Claude augmented with KIT

88–99%

Reduction in input tokens across all six codebases and both languages.

13×

Fewer tool calls — baseline Claude used up to 13× more tool calls than augmented sessions.

$4,000

In token savings per developer, annually — based on Claude's own analysis. KIT is free.

Richer answers

On blast-radius analysis, Kerno-augmented Claude surfaced an architectural risk the baseline response never reached — depth that token counts alone don't capture.

02 / METHODOLOGY

A manual, human-run benchmark.

Test design

This was a manual benchmark, not an automated test suite. Both conditions — Claude Code with Kerno ("Kerno-augmented") and Claude Code without Kerno ("baseline") — were run by human operators against identical prompts and codebases. Human judgment was used to interpret some outputs, particularly the accuracy comparisons in Section 6.

Disclosure
The benchmark was not blinded: the operator knew which condition they were running. Readers should weigh this accordingly.

Tooling & observability

For the Kerno-augmented condition, two observability layers ran simultaneously. Langfuse provided LLM-level observability: full traces, per-request token counts (input, output, and cached), tool-call sequences, and span-level timing. Bifrost provided API-level data: latency per request, raw request/response logs, and aggregate session summaries. The two layers were cross-referenced to produce the per-prompt metrics reported here.

For the baseline condition, Claude Code console logs were the primary source for token counts and tool-call sequences. Bifrost timestamps aligned the two datasets so latency comparisons reflect the same prompt under comparable network conditions. Prompt caching was enabled in both conditions and held constant — it is not a variable between the two states.

Prompt set

Up to eight prompts were tested per codebase, grouped into two categories. Python codebases received a slightly adapted set: Express-specific queries were replaced with FastAPI equivalents, and the two TypeScript-only prompts (user-story generation and OpenAPI spec generation) were excluded from Python runs.

Prompt Category ID
Where does [framework] live in this codebase? Lightweight L1
Which [framework] symbols are used, and where? Lightweight L2
Show every file that calls [service] Lightweight L3
Scan workspace Heavy H1
List all endpoints Heavy H2
Show me the blast radius [endpoint] Heavy H3
Find all user stories for this project Heavy H4
Generate OpenAPI specifications* Heavy H5

* For H5 we are interested in how Claude first derives the API structure to feed OpenAPI tools.

Codebases

Up to eight prompts were tested per codebase, grouped into two categories. Python codebases received a slightly adapted set: Express-specific queries were replaced with FastAPI equivalents, and the two TypeScript-only prompts (user-story generation and OpenAPI spec generation) were excluded from Python runs.

Codebase Size Endpoints Description
TS Example-TypeScript Small 20 Example codebase created by Kerno
TS Docmost Medium 80 Collaborative documentation platform
TS Laudspeaker Large 200+ Customer messaging platform
PY Example-Python Small 14 Example codebase created by Kerno
PY Keep Medium 123 Note-taking and knowledge base
PY Apache Airflow Large 500+ Workflow orchestration platform

Consistency controls

  • The Kerno-augmented condition used Claude Sonnet 4.6 throughout.
  • The baseline condition used Claude Code's adaptive model setting — the default for most users, representing real-world baseline behaviour rather than a pinned single-model comparison.
  • Prompt caching was active in both conditions and held constant.
  • Claude began as a blank slate in both cases: no README context, no claude.md files, no custom memory or skills.
  • The token costs for calling Kerno tools were ignored, as they were <1K tokens and so were deemed not to have a material impact on token numbers.
03 / LIMITATIONS

The benchmark conditions used.

Language

Covers TypeScript and Python across six codebases. The two prompt sets differ slightly (Python omits user-story and OpenAPI prompts, and uses FastAPI instead of Express). Results for Go, Java, or Ruby have not been measured.

Prompt caching

Enabled in both conditions, so token-cost comparisons reflect caching on both sides — the relative difference carries meaning, not the absolute counts in isolation. Cache-write costs are excluded.

Model

Both baseline and augmented runs used Claude Sonnet 4.6.

Claude context

No prior context in either condition: no READMEs, no claude.md, no injected memory, no custom tools beyond Kerno's. An intentionally conservative choice — it tests the default day-one experience. A developer with a tailored Claude configuration may see a different baseline.

04 / RESULTS: AGGREGATE

Consistent, large token savings and a nuanced latency picture.

Across all six codebases, two languages, and up to eight prompts per codebase, the token-consumption gap between Kerno-augmented and baseline Claude is consistent and large: between 88% and 99% fewer input tokens in every case. Tool-call reductions range from 61% to 87%.

The latency picture is more nuanced: Kerno is faster on small and medium codebases, but slower on large ones when generating comprehensive outputs. Both patterns are visible below and explained in the per-codebase commentary.

Input-token reduction vs baseline
Percentage fewer input tokens, representative session per codebase. Higher is better.
TypeScript
Python
Example-TS
Small · 20
97.0%
Docmost
Medium · 80
97.2%
Laudspeaker
Large · 200+
89.1%
Example-PY
Small · 14
98.5%
Keep
Medium · 123
98.9%
Apache Airflow
Large · 500+
92.3%

Token savings are consistent across both languages. The largest absolute saving is on Docmost (medium TypeScript): 6,708 tokens with Kerno versus ~236,000 without. Keep (medium Python) shows 98.9%: 2,200 vs ~205,800. Apache Airflow (large Python) produces the largest absolute saving in the benchmark: 25,300 vs ~330,000.

Absolute input tokens — the same query, two conditions
Representative session. Lower is better. Note how small the Kerno bars are.
Kerno-augmented
Baseline
Docmost
Medium TS
6,708
~236,000
Keep
Medium PY
2,200
~205,800
Airflow
Large PY
25,300
~330,000
Average latency per prompt — small & medium codebases
Seconds. Lower is better. Large codebases are a deliberate trade-off (see below).
Kerno-augmented
Baseline
Example-TS
Small TS
24.2s
41.5s
Docmost
Medium TS
31.9s
94.5s
Example-PY
Small PY
15.3s
12.7s
Keep
Medium PY
20.8s
37.2s
The large-codebase trade-off
On large codebases (Laudspeaker, Airflow) Kerno is slower on average. This is consistent: on large codebases, Kerno invests more time generating comprehensive structured output rather than returning partial results quickly. Whether that trade-off is favourable depends on whether completeness or raw speed is the priority for the task at hand.
Tool calls per session
Every call avoided is a file read, grep, or shell execution that no longer consumes context.
Kerno-augmented
Baseline
Docmost
Medium TS
24
190 calls · 87% fewer
Keep
Medium PY
5
71 calls · 92% fewer

The cost picture is consistent: Kerno sessions cost between $0.007 and $0.076, while equivalent baseline sessions cost $0.10 to $0.99. The most expensive Kerno session (Airflow, $0.076) costs less than the cheapest baseline session (Example-Python, $0.10).

Cost per session range (Claude Sonnet, $3/MTok input)
Kerno's entire range sits below the baseline's floor.
Kerno
$0.007 - $0.076
$0.007-0.076
Baseline
$0.10 - $0.99
$0.10-0.99
05 / RESULTS: PER PROMPT

Prompt by prompt, where the advantage shows up.

This section presents per-prompt results, first for TypeScript (all 8 prompts), then Python (6 prompts). TypeScript token values use Bifrost session totals where available; Python values are summed from individual prompt logs. Baseline values come from Claude Code console logs in both cases. Where a metric shows no clear advantage for either condition, it is reported as-is.

Lightweight queries · TypeScript

L1 — Where does [framework] live in this codebase?

Tests the ability to locate a framework or library within a codebase, a query a developer runs before changing dependencies.

Codebase Kerno tok Kerno lat Base tok Base lat Δ Tokens
Small 0 16s 0 9s N/A
Medium 0 1s 34,900 27s ~100%
Large 0 47s 0 22s N/A

Where the framework exists (medium, 34.9K tokens for baseline), Kerno serves the answer directly in 1 second from its index. The 47s on the large codebase reflects deeper graph traversal on a more complex dependency tree — not token-reading overhead.

L2 — Which [framework] symbols are used, and where?

Codebase Kerno tok Kerno lat Base tok Base lat Δ Tokens
Small 0 19s 32,000 20s ~100%
Medium 0 26s 1,000 13s ~100%
Large 0 23s 1,000 43s ~100%

On the small codebase, baseline Claude read 32K tokens to resolve symbol references; Kerno answered in 19s with zero file reads. On the large codebase, Kerno saved tokens and was 20 seconds faster.

L3 — Show every file that calls [service]

Codebase Kerno tok Kerno lat Base tok Base lat Δ Tokens
Small 0 25s 0 13s N/A
Medium 0 16s 1,000 12s ~100%
Large 0 26s 1,000 31s ~100%

Kerno was slightly slower on small/medium here, but the accuracy difference is significant — see Section 6, where Kerno-augmented Claude identified a circular dependency via forwardRef and annotated dead test code, while baseline produced a flat file list without role context.

Heavy queries · TypeScript

H1 — Scan workspace

Codebase Kerno tok Kerno t Base tok Base tools Base t Δ Tok
Small 1,133 21s ~39,000 31 58s -97.1%
Medium 512 7s ~39,000 22 52s -98.7%
Large 8,200 24s ~55,000 26 44s -85.1%

The workspace scan shows the clearest speed advantage: 7s–24s versus 44s–58s for baseline, a 2–7× improvement. Baseline invoked 22–31 tools to reconstruct what Kerno's index already contains.

H2 — List all endpoints

Codebase Kerno tok Kerno t Base tok Base tools Base t Δ Tok
Small 1,122 16s ~28,000 6 10s -96.0%
Medium 5,089 105s ~113,000 44 91s -95.5%
Large 8,120 171s ~89,000 26 55s -90.9%

A quality-versus-speed trade-off. On medium and large codebases Kerno was slower, but the output was substantially more comprehensive — a detailed, grouped endpoint inventory. When completeness matters — documentation, auditing, security review — the latency increase purchases a categorically better answer.

H3 — Blast radius for a specific endpoint

Codebase Kerno tok Kerno lat Base tok Base tools Base lat Δ
Small 0 22s 0 0 44s N/A
Medium 0 23s ~49,000 34 105s -100%
Large 0 61s ~27,000 11 50s N/A

On the medium codebase, the single most dramatic result: a full blast-radius analysis in 23s with zero file reads, while baseline spent 105s and 49K tokens for a less detailed output. The large codebase shows a slight reversal (61s vs 50s) attributable to tracing 13 eager-loaded relations across 140 endpoints — a finding Kerno surfaced explicitly as a key architectural risk.

H4 — Find all user stories for this project

Codebase Kerno tok Kerno lat Base tok Base tools Base lat Δ Tok
Small 3,120 60s ~31,000 14 48s -89.9%
Medium 3,000 28s ~31,600 25 65s -90.5%
Large ~776 50s ~62,000 44 104s -98.7%

Baseline produced a negative result on Docmost — correctly concluding no formal user stories exist, but missing that the endpoint structure itself encodes a complete user-story set. Kerno-augmented Claude used 2 API calls, found 90 endpoints, and produced 40+ user stories across 12 domains in 28s. The clearest accuracy gap in the benchmark.

H5 — Generate OpenAPI specifications

Codebase Kerno tok Kerno lat Base tok Base lat Δ
Small 0 5s ~10,000 130s -100% tok · 96% faster
Medium 0 49s ~126,000 391s -100% tok · 87% faster
Large 2,850 55s ~57,800 67s -95.1% tok · 18% faster

A strong, consistent outperformance: on the small and medium codebases an 87–96% latency reduction, and even on the large codebase Kerno was faster (55s vs 67s) while cutting tokens 95%. On the medium codebase, baseline made 65 tool calls over 391 seconds. Claude's own assessment (Section 8) noted that Kerno's deep static analysis let it spot a spec error file-reading alone would have missed: a 403 response on POST /users/login that the spec documented only as 422.

Python codebase results

The Python set covers six prompts: H1, H2, L1 (FastAPI location), L2 (route organisation), L3 (Auth Service files), and H3. Apache Airflow did not include L2. H4 and H5 were not run on Python.

H1 — Scan workspace (Python)

Codebase Kerno tok Kerno t Base tok Base tools Base t Saving
Example-PY Small 468 5s 31,400 17 32s -98.5%
Keep Medium 2,200 26s 40,000 17 41s -94.5%
Airflow Large 10,000 52s 190,000 35 120s -94.7%

Consistent across Python. Airflow shows the largest absolute saving: 10,000 vs 190,000 tokens, with latency improving from 120s to 52s.

H2 — List all endpoints

Codebase Kerno tok Kerno t Base tok Base tools Base t Saving
Example-PY Small 1,759 21s 1,000 1 11s N/A (base faster)
Keep Medium 0 28s 107,000 47 91s -100%
Airflow Large 15,300 180s 120,000 24 45s -87.3%

The small Python codebase is an anomaly: with only 14 endpoints, direct file exploration is efficient enough that baseline wins. Airflow repeats the large-codebase pattern — Kerno takes longer (180s vs 45s) because it generates a substantially more complete inventory (15,300 output tokens).

L1, L2, L3 — Lightweight queries (Python)

Codebase / Prompt Kerno tok Kerno t Base tok Base t Outcome
Example-PY · L1 0 10s 0 2s Baseline 8s faster
Example-PY · L2 0 2s 0 5s Kerno 3s faster
Example-PY · L3 0 35s 1,000 10s 100% token saving
Keep · L1 0 16s 1,000 10s 100% token saving
Keep · L2 0 17s 1,000 21s 100% saving · Kerno 4s faster
Keep · L3 0 16s 1,800 16s 100% saving · equal latency
Airflow · L1 0 90s 10,000 32s 100% saving · baseline 58s faster
Airflow · L3 0 42s 5,000 15s 100% saving · baseline 27s faster

Token savings are consistent (Kerno uses zero file-reading tokens in nearly all cases), but latency is more mixed than TypeScript. On Airflow (large), baseline is notably faster on lightweight queries — likely the complexity of graph traversal across 500+ endpoints, where the index query itself becomes non-trivial.

H3 — Blast radius (Python)

Codebase Kerno tok Kerno t Base tok Base tools Base t Saving
Example-PY Small 0 19s 1,000 1 16s -100% · base 3s faster
Keep Medium 0 22s 55,000 4 44s -100% · 50% faster
Airflow Large 0 58s 44,000 5 44s -100% · base 14s faster

The blast-radius result on Keep is the strongest Python result in the benchmark: zero file-reading tokens, answered in 22s versus 44s, avoiding 4 tool calls that consumed 55,000 tokens.

06 / RESULTS: ACCURACY

Beyond tokens: quality of output.

A response that returns in 5 seconds with 500 tokens but answers the wrong question costs more than a 60-second response that answers correctly. This section compares what each condition actually produced, for three prompts where accuracy differences were most pronounced.

Comparison 1 — Finding user stories | Docmost, 80 endpoints

claude — without kerno
find all user stories for this project
reading README.md …
scanning .github/ …
checking GitHub Issues …
grep "user story" src/** → 0 matches
reading docs/ …
✗ No explicit user stories found.
Conclusion: this project does not
document formal user stories.
claude + kerno
find all user stories for this project
kerno_list_endpoints → 90 endpoints
deriving stories from endpoint shapes…
✓ 40+ user stories across 12 domains:
Authentication · Sessions · Spaces
Pages · Transclusion · Search
Notifications · …
VERDICT
Baseline's negative result was technically accurate — there are no explicit user-story documents — but it missed the insight that an endpoint map is a user-story specification. The question a developer actually wants answered, "what does this product do, from a user's perspective?", was answered substantively only by Kerno-augmented Claude.

Comparison 2 — Blast radius analysis | Laudspeaker, 174 endpoints

claude — without kerno
blast radius: GET
/api/workspaces/channels
⚠ Kerno's index is still down
falling back to grep…
searched 3 patterns · read 1 file
→ basic call chain:
controller → service
(auth layer / serialization not traced)
claude + kerno
blast radius: GET
/api/workspaces/channels
✓ full request pipeline:
middleware → Passport guard → auth
helper → serializer → controller →
service
13 entities serialized · every file
  annotated
▸ ARCHITECTURAL RISK
"fat auth query" auth.helper.ts:122-131
loads 13 relations on every request →
affects all 140 endpoints
VERDICT
The baseline analysis was correct for what it found but incomplete: it traced the immediate call chain yet missed the auth layer, the serialization tree, and the architectural risk. Kerno-augmented Claude produced — in 61 seconds — an output that would take an experienced developer 15–30 minutes to construct manually.

Comparison 3 — Files calling Auth Service | Laudspeaker, 174 endpoints

claude — without kerno
every file that calls Auth Service
searched 3 patterns
→ 5 production files:
accounts.service.ts
customers.service.ts …
(flat list · brief descriptions)
claude + kerno
every file that calls Auth Service
✓ same 5 production files — plus:
per-method call sites + line numbers
▸ circular dependency:
accounts.service.ts
injects AuthService via forwardRef
▸ dead code: auth.service.spec.ts:156-209
commented-out tests, never removed
→ offers call_hierarchy / find_references
VERDICT
Both responses identified the same five production files — the factual outcome was equivalent. The difference is depth and actionability. For a developer performing impact analysis before a change, the Kerno response is directly actionable; the baseline requires additional follow-up queries.

THE UNDERLINING MECHANISM

Kerno is not just saving tokens by reducing what Claude reads — it is changing which tokens Claude sees. Given a structured endpoint map instead of raw file contents, Claude reasons about architecture rather than text. Given a dependency graph rather than a flat file tree, it identifies risk rather than listing files. The token savings are the visible metric; the accuracy improvement is the practical value.

07 / ABOUT KIT

A code-intelligence layer, delivered over MCP.

KIT replaces broad, expensive searches with precise, indexed lookups. Instead of reading entire files, agents jump directly to symbol definitions, find usages across a codebase, trace call chains, and list endpoints without touching unnecessary code.

KIT started as internal tooling — a lightweight code index we built to power Kerno's testing engine, not a standalone product. But as token costs climb and AI agents spend more time (and money) reading files they don't need, we felt it was worth sharing.

It's a stripped-down fork of SCIP with additional engineering to make it lightweight and fast. We have not designed it to replace deep indexing tools — others do that better. What it does is give your agent a smaller, faster path to what it actually needs: jump to a definition, find usages, trace a call chain, list endpoints without touching unnecessary code. Less noise, fewer tokens, faster results.

More language support for Ruby, Java, and Go arriving in Q3 2026

Get started in less than 1 minutes

STEP 1

Install the Kerno CLI

Download the Kerno agent to your machine.

STEP 2

Initialize your project

Install the Kerno agent inside your workspace.

STEP 3

Configure the Kerno MCP

Add the Kerno MCP to your AI coding agent.

08 / RETURN ON INVESTMENT (ROI) SUMMARY

Return of Investment (ROI) Summary.

Using the benchmark's raw performance data, reduced token costs alone yield per-developer savings of $220–1,300 per year, rising to $500–4,000 once compound effects (smaller contexts producing better cache hits) are factored in. At 60 sessions per developer per week, that works out to $0.08–0.44 saved per session.

Scaled up, a 10-person team saves roughly $5,000–40,000 annually, while a 1,000-engineer organization saves a conservative $249,500–1,372,000 per year, plausibly reaching $4,000,000 at higher usage intensity once recovered developer time is included. The case is straightforward: token savings alone likely cover the cost of the tooling, before counting the time recovered by resolving queries in around 2 calls instead of 20 or more.

ROI Summary

Metric Without Kerno With Kerno Improvement
Input tokens
all 6 codebases, avg per session
~34K to ~330K 2,200 to 25,300 88% to 99% reduction
Latency
small/medium codebases, avg per prompt
~13 to 94 seconds ~15 to 32 seconds Up to 66% faster
Latency
large codebases, avg per prompt
~51 to 52 seconds ~57 to 84 seconds Trade-off: richer output
Tool calls
total per session
20 to 190 calls 5 to 48 requests 61% to 92% fewer
Est. cost per session
Claude Sonnet, $3/MTok
$0.10 to $0.99 $0.007 to $0.076 >90% cost saving

Give your agent a faster path to what it needs.

KIT installs as an MCP server for Claude Code, Cursor, or Codex in under five minutes — and it's free.

Setup in 60 seconds | Works with any code AI code agent