Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent

¹UESTC, ²The University of Hong Kong, ³Xiaohongshu Inc.

^*Equal contribution. ^†Corresponding authors.

Abstract

Tool-Genesis is a task-driven benchmark for self-evolving language agents that evaluates whether they can create reusable tools from abstract requirements, rather than simply call predefined APIs. It measures the full tool-creation process—from interface inference and executable implementation to functional validation and downstream utility—providing an execution-grounded view of where tool creation succeeds or fails in practice.

To address this, we propose Tool-Genesis, a diagnostic benchmark designed to quantify agent capabilities across multiple dimensions—from interface compliance and functional correctness to downstream utility. It evaluates the ability of agents to construct task-relevant tools solely from abstract requirements (without pre-set specifications) and solve realistic problems.

Crucially, we find that even state-of-the-art models struggle to construct precise tool interfaces or executable logic in a one-shot setting. These minor initial flaws are amplified through the pipeline, leading to a precipitous drop in downstream metrics. We hope this benchmark will guide future research toward steering models to synthesize persistent, general-purpose tools capable of addressing broader real-world challenges.

Leaderboard

Sorted by SR (descending) · Click any column header to sort

L1: Surface Compliance

L2: Semantic Interface Fidelity

L3: Functional Correctness

L4: Downstream Task Utility

Best in Column

BibTeX

@misc{xia2026toolgenesistaskdriventoolcreation, title={Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent}, author={Bowei Xia and Mengkang Hu and Shijian Wang and Jiarui Jin and Wenxiang Jiao and Yuan Lu and Kexin Li and Ping Luo}, year={2026}, eprint={2603.05578}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2603.05578} }