KIT (Kerno intelligence tooling) Benchmark

01 / EXECUTIVE SUMMARY

Does adding Kerno to Claude Code make a measurable difference?

This document reports a structured benchmark comparing Claude Code alone against Claude Code augmented with Kerno Intelligence Tools (KIT). It was designed to answer one practical question: does adding Kerno produce a measurable difference in token consumption, latency, and output quality across representative developer tasks?

The benchmark covers three open-source TypeScript codebases and three open-source Python codebases, ranging from 20 to 500+ endpoints. In total 43 prompts were analysed, spanning routine lookups that represent day-to-day use through to heavier analytical tasks that represent low-frequency, high-effort work.

Background

According to SWE benchmarks, Claude is the most capable AI coding agent. Given a codebase it has not seen before, it can locate endpoints, trace dependencies, and reason about structure — but it does so by exploring: reading files, running grep patterns, executing shell commands. On small codebases that works reasonably well. On larger or more complex ones, the model spends a significant portion of each session building context it could have started with.

KIT provides that starting context. It pre-indexes a codebase into a structural graph: endpoints, call chains, symbol references, dependency relationships. When Claude needs to answer a question, it queries Kerno's graph rather than exploring the filesystem directly. The practical effect is fewer tool calls, lower token consumption, and in several cases measurably more complete answers.

Key findings

Claude augmented with KIT

88–99%

Reduction in input tokens across all six codebases and both languages.

13×

Fewer tool calls — baseline Claude used up to 13× more tool calls than augmented sessions.

$4,000

In token savings per developer, annually — based on Claude's own analysis. KIT is free.

Richer answers

On blast-radius analysis, Kerno-augmented Claude surfaced an architectural risk the baseline response never reached — depth that token counts alone don't capture.

02 / METHODOLOGY

A manual, human-run benchmark.

Test design

This was a manual benchmark, not an automated test suite. Both conditions — Claude Code with Kerno ("Kerno-augmented") and Claude Code without Kerno ("baseline") — were run by human operators against identical prompts and codebases. Human judgment was used to interpret some outputs, particularly the accuracy comparisons in Section 6.

Disclosure
The benchmark was not blinded: the operator knew which condition they were running. Readers should weigh this accordingly.

Tooling & observability

For the Kerno-augmented condition, two observability layers ran simultaneously. Langfuse provided LLM-level observability: full traces, per-request token counts (input, output, and cached), tool-call sequences, and span-level timing. Bifrost provided API-level data: latency per request, raw request/response logs, and aggregate session summaries. The two layers were cross-referenced to produce the per-prompt metrics reported here.

For the baseline condition, Claude Code console logs were the primary source for token counts and tool-call sequences. Bifrost timestamps aligned the two datasets so latency comparisons reflect the same prompt under comparable network conditions. Prompt caching was enabled in both conditions and held constant — it is not a variable between the two states.

Prompt set

Up to eight prompts were tested per codebase, grouped into two categories. Python codebases received a slightly adapted set: Express-specific queries were replaced with FastAPI equivalents, and the two TypeScript-only prompts (user-story generation and OpenAPI spec generation) were excluded from Python runs.

Prompt	Category	ID
Where does [framework] live in this codebase?	Lightweight	L1
Which [framework] symbols are used, and where?	Lightweight	L2
Show every file that calls [service]	Lightweight	L3
Scan workspace	Heavy	H1
List all endpoints	Heavy	H2
Show me the blast radius [endpoint]	Heavy	H3
Find all user stories for this project	Heavy	H4
Generate OpenAPI specifications^*	Heavy	H5

^* For H5 we are interested in how Claude first derives the API structure to feed OpenAPI tools.

Codebases

Up to eight prompts were tested per codebase, grouped into two categories. Python codebases received a slightly adapted set: Express-specific queries were replaced with FastAPI equivalents, and the two TypeScript-only prompts (user-story generation and OpenAPI spec generation) were excluded from Python runs.

Codebase	Size	Endpoints	Description
TS Example-TypeScript	Small	20	Example codebase created by Kerno
TS Docmost	Medium	80	Collaborative documentation platform
TS Laudspeaker	Large	200+	Customer messaging platform
PY Example-Python	Small	14	Example codebase created by Kerno
PY Keep	Medium	123	Note-taking and knowledge base
PY Apache Airflow	Large	500+	Workflow orchestration platform

Consistency controls

The Kerno-augmented condition used Claude Sonnet 4.6 throughout.
The baseline condition used Claude Code's adaptive model setting — the default for most users, representing real-world baseline behaviour rather than a pinned single-model comparison.
Prompt caching was active in both conditions and held constant.

Claude began as a blank slate in both cases: no README context, no claude.md files, no custom memory or skills.
The token costs for calling Kerno tools were ignored, as they were <1K tokens and so were deemed not to have a material impact on token numbers.

03 / LIMITATIONS

The benchmark conditions used.

Language

Covers TypeScript and Python across six codebases. The two prompt sets differ slightly (Python omits user-story and OpenAPI prompts, and uses FastAPI instead of Express). Results for Go, Java, or Ruby have not been measured.

Prompt caching

Enabled in both conditions, so token-cost comparisons reflect caching on both sides — the relative difference carries meaning, not the absolute counts in isolation. Cache-write costs are excluded.

Model

Both baseline and augmented runs used Claude Sonnet 4.6.

Claude context

No prior context in either condition: no READMEs, no claude.md, no injected memory, no custom tools beyond Kerno's. An intentionally conservative choice — it tests the default day-one experience. A developer with a tailored Claude configuration may see a different baseline.

04 / RESULTS: AGGREGATE

Consistent, large token savings and a nuanced latency picture.

Across all six codebases, two languages, and up to eight prompts per codebase, the token-consumption gap between Kerno-augmented and baseline Claude is consistent and large: between 88% and 99% fewer input tokens in every case. Tool-call reductions range from 61% to 87%.

The latency picture is more nuanced: Kerno is faster on small and medium codebases, but slower on large ones when generating comprehensive outputs. Both patterns are visible below and explained in the per-codebase commentary.

Input-token reduction vs baseline

Percentage fewer input tokens, representative session per codebase. Higher is better.

TypeScript

Python

Example-TS

Small · 20

97.0%

Docmost

Medium · 80

97.2%

Laudspeaker

Large · 200+

89.1%

Example-PY

Small · 14

98.5%

Keep

Medium · 123

98.9%

Apache Airflow

Large · 500+

92.3%

Token savings are consistent across both languages. The largest absolute saving is on Docmost (medium TypeScript): 6,708 tokens with Kerno versus ~236,000 without. Keep (medium Python) shows 98.9%: 2,200 vs ~205,800. Apache Airflow (large Python) produces the largest absolute saving in the benchmark: 25,300 vs ~330,000.

Absolute input tokens — the same query, two conditions

Representative session. Lower is better. Note how small the Kerno bars are.

Kerno-augmented

Baseline

Docmost

Medium TS

6,708

~236,000

Keep

Medium PY

2,200

~205,800

Airflow

Large PY

25,300

~330,000

Average latency per prompt — small & medium codebases

Seconds. Lower is better. Large codebases are a deliberate trade-off (see below).

Kerno-augmented

Baseline

Example-TS

Small TS

24.2s

41.5s

Docmost

Medium TS

31.9s

94.5s

Example-PY

Small PY

15.3s

12.7s

Keep

Medium PY

20.8s

37.2s

The large-codebase trade-off
On large codebases (Laudspeaker, Airflow) Kerno is slower on average. This is consistent: on large codebases, Kerno invests more time generating comprehensive structured output rather than returning partial results quickly. Whether that trade-off is favourable depends on whether completeness or raw speed is the priority for the task at hand.

Tool calls per session

Every call avoided is a file read, grep, or shell execution that no longer consumes context.

Kerno-augmented

Baseline

Docmost

Medium TS

24

190 calls · 87% fewer

Keep

Medium PY

5

71 calls · 92% fewer

The cost picture is consistent: Kerno sessions cost between $0.007 and $0.076, while equivalent baseline sessions cost $0.10 to $0.99. The most expensive Kerno session (Airflow, $0.076) costs less than the cheapest baseline session (Example-Python, $0.10).

Cost per session range (Claude Sonnet, $3/MTok input)

Kerno's entire range sits below the baseline's floor.

Kerno

$0.007 - $0.076

$0.007-0.076

Baseline

$0.10 - $0.99

$0.10-0.99

05 / RESULTS: PER PROMPT

Prompt by prompt, where the advantage shows up.

This section presents per-prompt results, first for TypeScript (all 8 prompts), then Python (6 prompts). TypeScript token values use Bifrost session totals where available; Python values are summed from individual prompt logs. Baseline values come from Claude Code console logs in both cases. Where a metric shows no clear advantage for either condition, it is reported as-is.

Lightweight queries · TypeScript

L1 — Where does [framework] live in this codebase?

Tests the ability to locate a framework or library within a codebase, a query a developer runs before changing dependencies.

Codebase	Kerno lat	Base tok	Base lat	Δ Tokens
Small	16s	0	9s	N/A
Medium	1s	34,900	27s	~100%
Large	47s	0	22s	N/A

Where the framework exists (medium, 34.9K tokens for baseline), Kerno serves the answer directly in 1 second from its index. The 47s on the large codebase reflects deeper graph traversal on a more complex dependency tree — not token-reading overhead.

L2 — Which [framework] symbols are used, and where?

Codebase	Kerno lat	Base tok	Base lat	Δ Tokens
Small	19s	32,000	20s	~100%
Medium	26s	1,000	13s	~100%
Large	23s	1,000	43s	~100%

On the small codebase, baseline Claude read 32K tokens to resolve symbol references; Kerno answered in 19s with zero file reads. On the large codebase, Kerno saved tokens and was 20 seconds faster.

L3 — Show every file that calls [service]

Codebase	Kerno lat	Base tok	Base lat	Δ Tokens
Small	25s	0	13s	N/A
Medium	16s	1,000	12s	~100%
Large	26s	1,000	31s	~100%

Kerno was slightly slower on small/medium here, but the accuracy difference is significant — see Section 6, where Kerno-augmented Claude identified a circular dependency via forwardRef and annotated dead test code, while baseline produced a flat file list without role context.

Heavy queries · TypeScript

H1 — Scan workspace

Codebase	Kerno tok	Kerno t	Base tok	Base tools	Base t	Δ Tok
Small	1,133	21s	~39,000	31	58s	-97.1%
Medium	512	7s	~39,000	22	52s	-98.7%
Large	8,200	24s	~55,000	26	44s	-85.1%

The workspace scan shows the clearest speed advantage: 7s–24s versus 44s–58s for baseline, a 2–7× improvement. Baseline invoked 22–31 tools to reconstruct what Kerno's index already contains.

H2 — List all endpoints

Codebase	Kerno tok	Kerno t	Base tok	Base tools	Base t	Δ Tok
Small	1,122	16s	~28,000	6	10s	-96.0%
Medium	5,089	105s	~113,000	44	91s	-95.5%
Large	8,120	171s	~89,000	26	55s	-90.9%

A quality-versus-speed trade-off. On medium and large codebases Kerno was slower, but the output was substantially more comprehensive — a detailed, grouped endpoint inventory. When completeness matters — documentation, auditing, security review — the latency increase purchases a categorically better answer.

H3 — Blast radius for a specific endpoint

Codebase	Kerno lat	Base tok	Base tools	Base lat	Δ
Small	22s	0	0	44s	N/A
Medium	23s	~49,000	34	105s	-100%
Large	61s	~27,000	11	50s	N/A

On the medium codebase, the single most dramatic result: a full blast-radius analysis in 23s with zero file reads, while baseline spent 105s and 49K tokens for a less detailed output. The large codebase shows a slight reversal (61s vs 50s) attributable to tracing 13 eager-loaded relations across 140 endpoints — a finding Kerno surfaced explicitly as a key architectural risk.

H4 — Find all user stories for this project

Codebase	Kerno tok	Kerno lat	Base tok	Base tools	Base lat	Δ Tok
Small	3,120	60s	~31,000	14	48s	-89.9%
Medium	3,000	28s	~31,600	25	65s	-90.5%
Large	~776	50s	~62,000	44	104s	-98.7%

Baseline produced a negative result on Docmost — correctly concluding no formal user stories exist, but missing that the endpoint structure itself encodes a complete user-story set. Kerno-augmented Claude used 2 API calls, found 90 endpoints, and produced 40+ user stories across 12 domains in 28s. The clearest accuracy gap in the benchmark.

H5 — Generate OpenAPI specifications

Codebase	Kerno tok	Kerno lat	Base tok	Base lat	Δ
Small	0	5s	~10,000	130s	-100% tok · 96% faster
Medium	0	49s	~126,000	391s	-100% tok · 87% faster
Large	2,850	55s	~57,800	67s	-95.1% tok · 18% faster

A strong, consistent outperformance: on the small and medium codebases an 87–96% latency reduction, and even on the large codebase Kerno was faster (55s vs 67s) while cutting tokens 95%. On the medium codebase, baseline made 65 tool calls over 391 seconds. Claude's own assessment (Section 8) noted that Kerno's deep static analysis let it spot a spec error file-reading alone would have missed: a 403 response on POST /users/login that the spec documented only as 422.

Python codebase results

The Python set covers six prompts: H1, H2, L1 (FastAPI location), L2 (route organisation), L3 (Auth Service files), and H3. Apache Airflow did not include L2. H4 and H5 were not run on Python.

H1 — Scan workspace (Python)

Codebase	Kerno tok	Kerno t	Base tok	Base tools	Base t	Saving
Example-PY Small	468	5s	31,400	17	32s	-98.5%
Keep Medium	2,200	26s	40,000	17	41s	-94.5%
Airflow Large	10,000	52s	190,000	35	120s	-94.7%

Consistent across Python. Airflow shows the largest absolute saving: 10,000 vs 190,000 tokens, with latency improving from 120s to 52s.

H2 — List all endpoints

Codebase	Kerno tok	Kerno t	Base tok	Base tools	Base t	Saving
Example-PY Small	1,759	21s	1,000	1	11s	N/A (base faster)
Keep Medium	0	28s	107,000	47	91s	-100%
Airflow Large	15,300	180s	120,000	24	45s	-87.3%

The small Python codebase is an anomaly: with only 14 endpoints, direct file exploration is efficient enough that baseline wins. Airflow repeats the large-codebase pattern — Kerno takes longer (180s vs 45s) because it generates a substantially more complete inventory (15,300 output tokens).

L1, L2, L3 — Lightweight queries (Python)

Codebase / Prompt	Kerno t	Base tok	Base t	Outcome
Example-PY · L1	10s	0	2s	Baseline 8s faster
Example-PY · L2	2s	0	5s	Kerno 3s faster
Example-PY · L3	35s	1,000	10s	100% token saving
Keep · L1	16s	1,000	10s	100% token saving
Keep · L2	17s	1,000	21s	100% saving · Kerno 4s faster
Keep · L3	16s	1,800	16s	100% saving · equal latency
Airflow · L1	90s	10,000	32s	100% saving · baseline 58s faster
Airflow · L3	42s	5,000	15s	100% saving · baseline 27s faster

Token savings are consistent (Kerno uses zero file-reading tokens in nearly all cases), but latency is more mixed than TypeScript. On Airflow (large), baseline is notably faster on lightweight queries — likely the complexity of graph traversal across 500+ endpoints, where the index query itself becomes non-trivial.

H3 — Blast radius (Python)

Codebase	Kerno t	Base tok	Base tools	Base t	Saving
Example-PY Small	19s	1,000	1	16s	-100% · base 3s faster
Keep Medium	22s	55,000	4	44s	-100% · 50% faster
Airflow Large	58s	44,000	5	44s	-100% · base 14s faster

The blast-radius result on Keep is the strongest Python result in the benchmark: zero file-reading tokens, answered in 22s versus 44s, avoiding 4 tool calls that consumed 55,000 tokens.

06 / RESULTS: ACCURACY

Beyond tokens: quality of output.

A response that returns in 5 seconds with 500 tokens but answers the wrong question costs more than a 60-second response that answers correctly. This section compares what each condition actually produced, for three prompts where accuracy differences were most pronounced.

Comparison 1 — Finding user stories | Docmost, 80 endpoints

claude — without kerno

find all user stories for this project

reading README.md …

scanning .github/ …

checking GitHub Issues …

grep "user story" src/** → 0 matches

reading docs/ …

✗ No explicit user stories found.

Conclusion: this project does not

document formal user stories.

claude + kerno

find all user stories for this project

kerno_list_endpoints → 90 endpoints

deriving stories from endpoint shapes…

✓ 40+ user stories across 12 domains:

Authentication · Sessions · Spaces

Pages · Transclusion · Search

Notifications · …

VERDICT
Baseline's negative result was technically accurate — there are no explicit user-story documents — but it missed the insight that an endpoint map is a user-story specification. The question a developer actually wants answered, "what does this product do, from a user's perspective?", was answered substantively only by Kerno-augmented Claude.

Comparison 2 — Blast radius analysis | Laudspeaker, 174 endpoints

claude — without kerno

blast radius: GET
/api/workspaces/channels

⚠ Kerno's index is still down

falling back to grep…

searched 3 patterns · read 1 file

→ basic call chain:

controller → service

(auth layer / serialization not traced)

claude + kerno

blast radius: GET
/api/workspaces/channels

✓ full request pipeline:

middleware → Passport guard → auth

helper → serializer → controller →

service

13 entities serialized · every file
annotated

▸ ARCHITECTURAL RISK

"fat auth query" auth.helper.ts:122-131

loads 13 relations on every request →

affects all 140 endpoints

VERDICT
The baseline analysis was correct for what it found but incomplete: it traced the immediate call chain yet missed the auth layer, the serialization tree, and the architectural risk. Kerno-augmented Claude produced — in 61 seconds — an output that would take an experienced developer 15–30 minutes to construct manually.

Comparison 3 — Files calling Auth Service | Laudspeaker, 174 endpoints

claude — without kerno

every file that calls Auth Service

searched 3 patterns

→ 5 production files:

accounts.service.ts

customers.service.ts …

(flat list · brief descriptions)

claude + kerno

every file that calls Auth Service

✓ same 5 production files — plus:

per-method call sites + line numbers

▸ circular dependency:

accounts.service.ts

injects AuthService via forwardRef

▸ dead code: auth.service.spec.ts:156-209

commented-out tests, never removed

→ offers call_hierarchy / find_references

VERDICT
Both responses identified the same five production files — the factual outcome was equivalent. The difference is depth and actionability. For a developer performing impact analysis before a change, the Kerno response is directly actionable; the baseline requires additional follow-up queries.

THE UNDERLINING MECHANISM

Kerno is not just saving tokens by reducing what Claude reads — it is changing which tokens Claude sees. Given a structured endpoint map instead of raw file contents, Claude reasons about architecture rather than text. Given a dependency graph rather than a flat file tree, it identifies risk rather than listing files. The token savings are the visible metric; the accuracy improvement is the practical value.

07 / ABOUT KIT

A code-intelligence layer, delivered over MCP.

KIT replaces broad, expensive searches with precise, indexed lookups. Instead of reading entire files, agents jump directly to symbol definitions, find usages across a codebase, trace call chains, and list endpoints without touching unnecessary code.

KIT started as internal tooling — a lightweight code index we built to power Kerno's testing engine, not a standalone product. But as token costs climb and AI agents spend more time (and money) reading files they don't need, we felt it was worth sharing.

It's a stripped-down fork of SCIP with additional engineering to make it lightweight and fast. We have not designed it to replace deep indexing tools — others do that better. What it does is give your agent a smaller, faster path to what it actually needs: jump to a definition, find usages, trace a call chain, list endpoints without touching unnecessary code. Less noise, fewer tokens, faster results.

More language support for Ruby, Java, and Go arriving in Q3 2026

Get started in less than 1 minutes

STEP 1

Install the Kerno CLI

Download the Kerno agent to your machine.

npm i @kerno/cli

STEP 2

Initialize your project

Install the Kerno agent inside your workspace.

kerno -w /workspace

STEP 3

Configure the Kerno MCP

Add the Kerno MCP to your AI coding agent.

08 / RETURN ON INVESTMENT (ROI) SUMMARY

Return of Investment (ROI) Summary.

Using the benchmark's raw performance data, reduced token costs alone yield per-developer savings of $220–1,300 per year, rising to $500–4,000 once compound effects (smaller contexts producing better cache hits) are factored in. At 60 sessions per developer per week, that works out to $0.08–0.44 saved per session.

Scaled up, a 10-person team saves roughly $5,000–40,000 annually, while a 1,000-engineer organization saves a conservative $249,500–1,372,000 per year, plausibly reaching $4,000,000 at higher usage intensity once recovered developer time is included. The case is straightforward: token savings alone likely cover the cost of the tooling, before counting the time recovered by resolving queries in around 2 calls instead of 20 or more.

ROI Summary

Metric	Without Kerno	With Kerno	Improvement
Input tokens all 6 codebases, avg per session	~34K to ~330K	2,200 to 25,300	88% to 99% reduction
Latency small/medium codebases, avg per prompt	~13 to 94 seconds	~15 to 32 seconds	Up to 66% faster
Latency large codebases, avg per prompt	~51 to 52 seconds	~57 to 84 seconds	Trade-off: richer output
Tool calls total per session	20 to 190 calls	5 to 48 requests	61% to 92% fewer
Est. cost per session Claude Sonnet, $3/MTok	$0.10 to $0.99	$0.007 to $0.076	>90% cost saving

Comparing Claude's grep capabilities versus Claude augmented with Kerno intelligence tooling

88–99%

61–92%

2–7×

$4,000

Does adding Kerno to Claude Code make a measurable difference?

Background

Key findings

88–99%

13×

$4,000

Richer answers

A manual, human-run benchmark.

Test design

Tooling & observability

Prompt set

Codebases

Consistency controls

The benchmark conditions used.

Language

Prompt caching

Model

Claude context

Consistent, large token savings and a nuanced latency picture.

Prompt by prompt, where the advantage shows up.

Lightweight queries · TypeScript

L1 — Where does [framework] live in this codebase?

L2 — Which [framework] symbols are used, and where?

L3 — Show every file that calls [service]

Heavy queries · TypeScript

H1 — Scan workspace

H2 — List all endpoints

H3 — Blast radius for a specific endpoint

H4 — Find all user stories for this project

H5 — Generate OpenAPI specifications

Python codebase results

H1 — Scan workspace (Python)

H2 — List all endpoints

L1, L2, L3 — Lightweight queries (Python)

H3 — Blast radius (Python)

Beyond tokens: quality of output.

Comparison 1 — Finding user stories | Docmost, 80 endpoints

Comparison 2 — Blast radius analysis | Laudspeaker, 174 endpoints

Comparison 3 — Files calling Auth Service | Laudspeaker, 174 endpoints

A code-intelligence layer, delivered over MCP.

Get started in less than 1 minutes

Install the Kerno CLI

Initialize your project

Configure the Kerno MCP

Return of Investment (ROI) Summary.

ROI Summary

Give your agent a faster path to what it needs.