Engineering

Benchmarking AI Agent Tools: How We Measure Real-World Impact with Kerno MCP

June 25, 2025

“Okay, it looks convincing… but does it actually work? And does it make a difference for us?”

This is the kind of conversation we imagine is happening at every company starting to adopt AI agents. Measuring how well an LLM or agent performs is a difficult and evolving challenge. It becomes even more complex when we try to measure whether a specific tool, built to support agents, is actually useful in practice.

At Kerno, we take this very seriously. We want to ensure that the tools we’re building for agents don’t just work, but matter - that they reduce cost, increase speed, and deliver meaningful assistance in real-world engineering workflows.

In this post, we’ll outline our approach to benchmarking tools in the Kerno MCP (Model Context Protocol), and walk through an example task that illustrates our methodology.

The Challenge of Benchmarking Agent Tools

There is a growing body of work on benchmarking LLMs, especially for AI code copilots. These often rely on standardized tasks like Leetcode questions, which are well-defined and come with known-good answers. Models are run on these tasks, possibly with contextual code or repo input, and their responses are evaluated, either by running the code (compilation + test results), or by comparing outputs to a reference answer using either human judgment or an LLM-as-judge approach.

These are useful methods for benchmarking foundation models. But for evaluating tools that assist agents, especially tools designed for real software engineering tasks, these aren’t enough.

We need benchmarks that reflect actual developer workflows.

Our Approach: Engineering Tasks, Not Puzzle Problems

The tools we build at Kerno MCP are meant to give production context to AI agents. That means our benchmarks need to go beyond toy problems. We designed a set of realistic micro-service engineering tasks, based on common work developers do when building, modifying, or extending backend systems.

To evaluate these tasks, we compare how well an agent performs with and without access to Kerno MCP. We focus on:

Whether the task is successfully completed, using human evaluation;
The workflow the agent follows to reach a solution;
And the token usage, which has implications for both latency and cost.

Token use is especially important for agent workflows. Unlike single-shot LLM queries, agents often make multiple iterations, invoke tools repeatedly, and parse large contexts. This can be very expensive if they don’t quickly converge on a good path.

Providing relevant, structured context early can significantly reduce that overhead, which is what Kerno MCP is designed to do.

Why Token Efficiency Matters (More Than You Think)

LLMs are cheap today, but there are no guarantees this will last forever.

Token usage maps directly to cost and latency. With foundation models continuing to grow in size, and as demand increases, token pricing could eventually become coupled to energy prices, especially as inference becomes more compute-heavy.

If your AI agent needs to crawl a 10,000-line repo just to find out how services connect, that’s not sustainable. Giving it an upfront, structured view of the system like a dynamic service map , can save both time and money.

An Example Benchmark: Cursor + Kerno MCP

To illustrate our approach, we tested a coding agent inside Cursor, using Claude-3 Sonnet in MAX mode (to get detailed token accounting). We gave it access to an example project, demo-corp, which is a real micro-service e-commerce system we use for internal testing and demos, see image below for overview. It includes synthetic traffic and a variety of services across multiple languages.

The Task

We removed the checkout service from the repo, then asked the agent to:

"Create a new endpoint for the checkout service, using Kotlin. Use the Kerno MCP tools to inspect the service map to guide the implementation."

We provided:

Functional specifications for the endpoint.
An instruction to use Kotlin, since demo-corp is polyglot.
In the with-MCP run, we also provided access to the MCP service map, a JSON file dynamically generated by Kerno Core. This includes:
- Service definitions
- API endpoints and request/response schemas
- Known inter-service dependencies

Two Experiments: With vs. Without Context

‍Result: Both runs produced valid, functional code that met the spec. But the run with access to MCP was significantly more efficient, 35% fewer tokens, less iteration, and faster convergence.

This shows the agent benefits not just from a better prompt, but from structured production context that lets it avoid costly reasoning and repo analysis.

Evaluation: Human + Structural

Each task is reviewed by human evaluators who assess:

Whether the code meets functional and structural requirements;
Whether it follows the language idioms and best practices;
How closely the agent adhered to the prompt’s instructions.

We also capture workflow diagnostics, such as which tools were invoked and when, as well as detailed token logs.

This lets us trace exactly how and why an agent reached a particular result, and what role the supporting tools played.

What's Next

As we continue to build out Kerno MCP, our benchmarking efforts will evolve in parallel. Upcoming efforts include:

Expanding the task suite to include infrastructure work, observability integration, CI/CD edits, and IaC refactors.
Incorporating LLM-as-judge for broader automated evaluation at scale.
Exploring energy-adjusted metrics that align token usage with compute sustainability goals.

We're committed to making AI agents more useful, more trustworthy, and more efficient in production settings. That means not just building smart tools, but proving they work.

Closing Thoughts

AI coding agents are already powerful. But to make them sustainable, scalable, and trustable, we need to give them better inputs, not just more compute.

Kerno MCP aims to close the gap between what the agent knows and what it needs to know to reason efficiently in real software systems. Benchmarks like the one described here help us hold ourselves accountable, and help you evaluate where AI agent tooling can deliver measurable value in your own engineering org.