Runtime evaluation of coding agents

February 10, 2026

Introduction

In this post we will discuss the two most robust evaluations for LLM output, particularly for agents that deal with code or code related tasks - exact string matching and evaluation by execution.

Exact string matching

Exact string matching is the evaluation that scientists and engineers would all love to be able to do on model output. For a given input, you evaluate whether the model or agent generates the exact expected output. You apply a simple logical equality statement

ActualOutput == ExpectedOutput

which will give you a logical 0 or 1. The result has no ambiguity.

The canonical example in NLP (natural language processing, or the field that LLMs emerged from) would be the following:

Input: “What is the capital of France?”

Expected output: “Paris”

For a coding task, an example would be:

Input: “What python version does this project use”

Expected output: “3.11.0”.

When LLM output may be verbose, and padded into full, idiomatic sentences rather than single or multiple word answers, we can search for the expected output as a required phrase in the generated output, using a regex for example, or the “in” operator in python. For the first case above, this might look like this:

Generated output: “The capital of France is Paris”.

The above would fail the logical equals test, but pass the regex test.

Unfortunately, the lack of ambiguity also applies to the task. As is probably clear, this evaluator can only be used on tasks where the builder can create a dataset of exact input-expected output pairs, and the agent can be expected to return the exact answer. These occur commonly in things like recall and search tasks.

Evaluation by Execution

As a form of text or data, code has the beautiful property that it can be evaluated to generate an output that can be logically evaluated. If the task for your agent/model generates an output that can be framed as something that can be executed and should have a specific, known behaviour, then this can be a very powerful way to evaluate model outputs. While we typically view tests as logical requirements, they also serve as a robust evaluation framework for coding agents. Integration tests, in particular, provide the runtime clarity necessary to verify intended behaviour - a core focus of our work at Kerno.

Benchmarking and Performance

In the literature, this is a very popular way of building benchmarks for LLMs fine tuned for coding tasks. Benchmarks based on Leetcode examples have been used for many different coding tasks, including:

General performance: (Livecodebench, 2024)
Code test generation: (TestBench, 2024)

These have exactly that nice set of features for evaluations: output that can be executed, and a known, expected behaviour on execution.

Implementation in Agentic Systems

In a larger agentic system, we find that it is useful to use appropriate flags in system logs to demonstrate whether the code generated by an LLM executes properly. For example, when our agent creates a docker-compose.yaml file, and attempts to spin up services to run tests, it is straightforward to write to the logs whether or not all of the services are running.

The Effectiveness of Evaluation by Execution

A comment made by Prof Narayanan at Princeton of AI Snake Oil, AI as Normal Technology and general hype debunking fame, here really emphasises the point of just how good execution is as an evaluation.

The Objective Source of Truth

He argues that this is why coding agents are better than other agents. It makes them fundamentally different from other agents as it gives them an objective source of truth or success for new content (i.e. code). He argues that this makes coding agents much more successful in multi-step processes than other agents because this truth can be checked at many stages along the process.

Industry Impact and Maturity

This goes some way to explain why coding agents have been one of the first applications of the LLM era of AI to really mature into something very reliable, and accounts for why the field has been so impressed by them. It should also give us pause when thinking about other problems or fields in which we might consider using LLM-based agents.

Software professionals are much more impressed with the performance of agents than other white-collar workers are.
That may well continue to be the case for quite some time.
We would likely only see such impressive performance on tasks where such logical evaluation can occur mid-process. Multi-step conversation, for example, is not one of these.

Coding Agents as Neurosymbolic AI

Narayanan goes on to argue that this possibility to logically evaluate intermediate outputs makes AI code agents a form of neurosymbolic AI, distinct from other types of generative AI. Neurosymbolic AI is an old idea, and was the basis of one of the previous AI booms in the 80s. The core idea is that it incorporates the two approaches of statistical (model) and logical (code) systems together.

A Practical Example

A simple example would be an image segmentation model. An image recognition model might use a multi-step process to identify an image of a book on a table:

Neural Network: Isolates the images and categorises of the book and the table separately.
Logical Model: Evaluates their relative positions as being $y(book)>y(table)$ & $x(book) = x(table)$, so the book is “on” the table.

Neurosymbolic AI has been an increasingly popular topic at AI conferences over the past several years as a new direction for the field after scaling. It would seem to be that we already have plenty of examples of new approaches to it and applications of it hiding in plain sight.

Closing

The intersection of statistical generation and logical execution has quietly moved coding agents into a category of their own. By treating code not just as text, but as a verifiable hypothesis, these systems have become the cloest implementation of neurosymbolic AI at scale. As the industry shifts its focus from pure scaling to architectural reliability, we are left with a compelling question: can we find "objective sources of truth" to build similar logical feedback loops for non-coding tasks, or will the neurosymbolic advantage remain exclusive to the world of software engineering?

Subscribe to our blog

Get the latest technical guides and product updates delivered to your inbox.

Subscribe to the AI Builder Series

Get a weekly roundup of practical guides, tools, and insights for building AI-native products.

You're in!

Oops! Something went wrong while submitting the form.