Teaching Agents to Write and Self-Correct End-to-End Test Scenarios

November 27, 2025

Introduction

Over the past few weeks, I’ve been experimenting with a new flow for generating test scenarios using agents — not just generating the text of scenarios, but having the agent iteratively write, run, diagnose, and repair them. What I demoed today is the first time I’ve seen the whole loop running cleanly end-to-end, and honestly, it felt a bit magical.

Let me walk you through what happened.

Starting With Almost Nothing

Each agent began with just a single scenario prompt — literally a couple of lines of English describing the behavior I wanted. No steps. No assertions. No IDs. Nothing but:

  • The ability to read and write one file (its scenario)
  • The ability to check Docker Compose state
  • The ability to dry-run or actually run the scenario

And then I launched three of these in parallel.

The first thing the agent did was read its scenario file. Seeing nothing meaningful there, it checked Docker Compose to understand the environment. Everything came back healthy. So with no prior structure to rely on, it made a bold move: it wrote the entire scenario from scratch.

Then came the dry run — which doesn’t hit Docker Compose, but does validate the scenario structure. It passed. So the agent went ahead with a real run.

Watching It Learn From Failure

The scenario started well: setup passed, the GET request passed… and then we hit the first failure.

The response headers didn’t contain the expected application/json; charset=utf-8, only application/json.

What did the agent do?

It rewrote the scenario. Specifically, it updated the expectation in the assertion to match reality. Then it ran the whole thing again.

The next failure came from a duplicate key — which shouldn’t happen, but is a great test for how the agent responds. Instead of flailing around, it took a smarter path: it pre-cleaned the database setup and retried.

This is the same diagnose–fix–rerun loop we’ve all done manually a hundred times, but now it’s fully encapsulated in the agent itself — no bouncing between frontend and backend, no hand-holding.

And Then It Started to Impress Me

Here’s the moment where I actually laughed out loud.

Remember how in earlier versions we were hard-coding database IDs because the agent didn’t understand how to reference inserted rows?

Well, this time, the agent generated a SQL statement like:

INSERT INTO articles (title, content, author_id)
SELECT
   'Example title',
   'Example content',
   id
FROM users
WHERE email = 'author@example.com';

It selected the correct author ID from the user it had just inserted.

That’s not just avoiding a bug — that’s genuinely smarter than what I originally did in hand-written scenarios.

Handling Nondeterministic Fields Gracefully

Another painful area: nondeterministic values like timestamps or generated IDs. Historically, the agent would assert on exact values and fail for obvious reasons.

Now? It simply omits unstable fields from the expected body and treats the assertion as a subset check, not an equality check.

If a value turns out to be nondeterministic, it strips it out and reruns automatically.

The result is a scenario that asserts precisely what matters — nothing more.

Parallel Execution, Smart Locking

Even though I ran three agents at once, everything stayed efficient:

  • Reading, writing, and dry runs all happen in parallel.
  • Only real scenario execution is serialized — and those take milliseconds.
  • So running 10 scenarios takes about as long as running one: roughly a minute to a minute and a half.

Given how much is happening under the hood, that’s shockingly fast.

Not Quite Released… But Close

This flow for defining scenarios — starting from natural language and letting the agent build and refine them — isn’t available yet. The UI work for defining new scenarios is still in progress.

But for the demo, I manually seeded three English descriptions, and the agent handled the rest.

Soon, you’ll just type something like:

“When retrieving articles for a user, return a 200 and the user’s articles.”

…hit Enter, and let the agent figure everything else out.

Final Thoughts

This is the closest I’ve seen to the original vision: a fully automated, iterative, self-correcting test-authoring agent that can learn from its mistakes, refine its expectations, and generate robust, clean scenarios — all starting from nothing but a sentence.

It’s fast.
It’s reliable.
And weirdly, it’s starting to write tests in ways even I wouldn’t have thought of.

Very soon, this will be in your hands. And I think it’s going to fundamentally change how teams write and maintain integration tests.

Subscribe to our blog
Get the latest technical guides and product updates delivered to your inbox.
Subscribe to the AI Builder Series
Get a weekly roundup of practical guides, tools, and insights for building AI-native products.
You're in!
Oops! Something went wrong while submitting the form.