Tool-Genesis is a task-driven benchmark for self-evolving language agents that evaluates whether they can create reusable tools from abstract requirements, rather than simply call predefined APIs. It measures the full tool-creation process—from interface inference and executable implementation to functional validation and downstream utility—providing an execution-grounded view of where tool creation succeeds or fails in practice.
To address this, we propose Tool-Genesis, a diagnostic benchmark designed to quantify agent capabilities across multiple dimensions—from interface compliance and functional correctness to downstream utility. It evaluates the ability of agents to construct task-relevant tools solely from abstract requirements (without pre-set specifications) and solve realistic problems.
Crucially, we find that even state-of-the-art models struggle to construct precise tool interfaces or executable logic in a one-shot setting. These minor initial flaws are amplified through the pipeline, leading to a precipitous drop in downstream metrics. We hope this benchmark will guide future research toward steering models to synthesize persistent, general-purpose tools capable of addressing broader real-world challenges.
@misc{xia2026toolgenesistaskdriventoolcreation,
title={Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent},
author={Bowei Xia and Mengkang Hu and Shijian Wang and Jiarui Jin and Wenxiang Jiao and Yuan Lu and Kexin Li and Ping Luo},
year={2026},
eprint={2603.05578},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2603.05578}
}