Introduction
I’ve been working on transforming the manual tracking and testing processes into a fully automated pipeline by converting raw data into a Langfuse dataset via a CLI tool. The goal is simple: take the metadata from that spreadsheet and automate the execution of each individual repository to test our agent's performance at scale.
Importance of Agent Testing Pipelines (APTs)
Basically, we can't rely on manual spreadsheets or checking a single repository to know if the agent is actually working. The reality is that LLMs are non-deterministic, they can work perfectly one time and fail the next, so we need volume to get any real confidence. Scaling this pipeline allows us to run experiments against fifty different projects at once, catching edge cases and regressions that a human would definitely miss. It’s about moving away from "hoping it works" to actually proving it, without spending hours manually verifying every single trace.
.png)
The Workflow
Previously, the tool only handled single-repository test generation. I have now expanded the logic to handle end-to-end execution based on trace metadata.
Here is the current flow of the CLI tool:
- It reads the repository metadata from the Langfuse dataset.
- It clones the specific repository into the benchmark_example folder.
- It starts the agent on that specific workspace.
- The agent kicks off its internal process: indexing the codebase and generating the Docker Compose file.
By having the CLI manage the repository setup and the agent handle the generation, we can now run full experiments on Docker Compose generation automatically.
Running an Experiment
To run an experiment, I’m using uv. The command is straightforward:
uv run benchmark run <experiment_config.yml>The entire experiment is defined in a YAML configuration file. This allows us to be granular about what we test. In the YAML, you define:
- Dataset & Run Name: To track it in Langfuse.
- Endpoint: The target URL.
- Input Mapping: How variables map to the endpoint.
- Filters: I introduced metadata filtering to control the scope. For example, right now, I’m filtering out "big projects" to prevent crashes while debugging, focusing only on smaller repositories.
- Evaluators: A list of automated checks to run on the output traces.
Current Results and Challenges
When the experiment runs, the results are pushed directly to Langfuse. You can see the specific run and the traces corresponding to each repository tested.
However, I am currently facing a synchronization issue regarding the Docker Compose status.
The CLI waits for a "status up" signal. The problem is that the agent sometimes reports the status as "up" while it is still trying to complete the generation. Consequently, the CLI receives the success signal and shuts down the process before the agent has actually finished the work. This results in truncated runs (e.g., seeing only 4 out of 10 expected items) or infinite loops where the status never resolves.
To fix this, we likely need to decouple the "Compose status" from the "Agent status." I am discussing with Jordan about adding a specific field or endpoint to poll the actual agent completion status, rather than just the Docker container status.
.png)
Automated Evaluation (LLM-as-a-Judge)
The next major step is integrating automated evaluation. While we can use Python-based checks (validating output schema, checking string length, etc.), the real value comes from LLM-as-a-judge.
The concept is to take the agent’s input and output, compare it alongside the "expected output" from our dataset, and have an LLM score the quality. This will annotate the traces automatically, simulating a human reviewer.
There is a trade-off here: relying on an LLM judge means we essentially have to "evaluate the evaluator" to ensure it's accurate. However, once tuned, this will significantly speed up our feedback loop compared to manual annotation.
Q: Should we run this in CI/CD?
My stance is that we should not run this on every Pull Request. Calling live LLM providers (OpenAI, etc.) is slow, expensive, and flaky (random 400 errors or JSON parsing failures).
Instead, this pipeline is best suited for pre-release checks (e.g., nightly builds) to ensure no regression occurs before a deployment.
Regarding performance, the tool currently runs sequentially. This is because we do not yet have multi-workspace support in the product. Once we enable multi-instance support or multi-workspace capabilities in the backend, we can parallelize these runs to drastically reduce execution time.
Next Steps
- Fix the Status Logic: Implement a robust check for agent completion to prevent premature CLI shutdown.
- Refine Evaluators: Work with Michael to finalize the LLM-as-a-judge implementation.
Scale: Once the status bug is squashed, remove the filters and run the benchmark against the full 50+ repository dataset.
