Dataset curation
One of the major elements of working in an ML or research paradigm for building agents is making improvements or changes informed by experiment. We have already covered why you would take this approach, how you might structure experiments, a worked example and how to evaluate agents. Here, we cover dataset curation. Dataset creation will be covered in another post.
Disclaimer: This is just our opinion on creating and curating datasets, based on what we have needed internally. Everyone will have an opinion about the makeup of datasets, and we are sure that professional statisticians will have a lot to complain about in what follows. We are primarily writing here to present our thinking on benchmarking agents for automated code testing, where we hope we are at least beyond Mt. Stupid and partway through the trough of despair on the Dunning-Kruger curve.

Datasets
There are three main attributes to consider for a dataset here - size, composition and diversity. Size is simply the number of items. Composition refers to the number of different dimensions or features in a dataset. These can be enumerated with categories for a dataset of discrete features. For the continuous case, which we won’t cover here, this count will be the integral over a not necessarily continuous or simply connected domain. Diversity refers to the distribution of the features and categories of the dataset - how many of each type do you include.
Maintaining a test and train split
This sounds obvious, or like it should just be taken for granted, but an example of AI companies violating this principle comes up every so often, so we discuss it here.
The makeup of the test and train split can depend on what exactly you are building. If you are training or fine-tuning a model where you actually change weights in some way, then your training set will likely be much larger than, or a similar size as your test set. Here the data is really doing the work in building the model. The model compresses the data and learns from it (fitting a high dimensional function that maps the data to some label or reward or such). The test data checks whether the model generalises to unseen data or overfits to the training data.
In building agents, where you already have a well-trained model, the split might look somewhat different. The “training” dataset might be much smaller than the testing one, and will typically consist of a few specifically chosen or designed test cases (like an example repository). This small dataset is used to shape the workflow of the agent, including identifying subtasks, writing system prompts, deciding about self-healing loops and stopping conditions, deciding on output formats and choosing tools.
The test set can then be much larger, as it does two things:
- Benchmark to test how successful an agent generalises to other cases.
- Identify areas of poor performance and indicate what changes should be made.
This may sound like the test set is being used as the training set and violates test/train split good practice, but that is not what we are doing here. The insights from the benchmark runs are used to make engineering changes to the workflow, in the same way that in the model training workflow, the insights from the test set inform feature engineering, hyper-parameter selection, or training methods. The test data isn’t encoded in the model weights because we are not tuning the model, only the architecture around using the model.
That said, it can be worth curating the yet another final benchmark dataset, which can be smaller, but needs comparable diversity as is expected to be seen in the real world data. The aim would be to confirm that the architecture of the agent hasn’t been overfit to the testing data, though this is less of a risk than when training model weights or data.

Choosing size and constituency
Dataset size depends on the purpose of the project, such as whether it is a major benchmark, prototyping or a targeted experiment.
Prototyping may take as little as a single item, or just a few, depending on task complexity and how long it takes to run the agent workflow. It helps to really focus on 1-3 examples that you understand well. It also helps to choose something simple, so that it is more understandable, but also might generalise more widely.
For targeted experiments, it might suffice to take less than 10, but ideally 10-20 items, depending on computational time taken for the experiment, but should be sufficient to
- Answer your research question.
- Generate statistics so that you can quantify the uncertainty of results.
An example experiment is given in this blog post.
Benchmarks should include tens to hundreds of items in order to generate robust statistics and cover the diversity of situations the agent is expected to work with in the real world. You want to cover as many of these situations as possible and include multiple cases of each in order to generate robust statistics, not just for the dataset as a whole, but for specific features or elements of the dataset.
Diversity
Prototyping should focus on specific features. Keep it simple.
Targeted experiments should include a diversity of items that is needed to answer your research question. Use your judgement. Increase the size if you feel you need a better measure of the uncertainty of results. Ensure the dataset includes the features that you need to probe or test in your experiment, but don’t overcomplicate it, as you need to be able to interpret and understand the results.
For benchmarks, you want to try to include all of the features that will be present in the real world - or at least enough to demonstrate that the agent can generalise to working with these features. In a perfect scenario, the distribution would mimic the expected distribution of the real world, but that is rarely possible.
For experiments, a useful addition is to include failure modes in the dataset - cases where you expect a failure, to ensure it does this correctly, and that the LLM doesn’t do something unexpected. This can also be useful in your benchmark experiments, but in this case be careful when interpreting headline results, as these will skew them, especially any “success” measures.
Conclusion
To conclude we will make gratuitous use of the “rule of three” in storytelling. One: there are three types of datasets that you will likely need when building agents depending on your task: datasets for prototyping, datasets for targeted experiments, and datasets for benchmarking. Two: for any of these, there are three things to take into account when curating a dataset - size, constituency and diversity. How you choose these will depend on the task. Finally three: the test/train split can work differently for building agents if you are not tuning or fitting model weights, so feel free to structure this differently to how you would for a more traditional ML project. Happy tinkering!
