Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent

1UESTC, 2The University of Hong Kong, 3Xiaohongshu Inc.
*Equal contribution. Corresponding authors.

An overview of Tool-Genesis as a task-driven benchmark:
Interface Inference derives tool signatures and I/O schemas from abstract requirements. Executable Implementation builds runnable tool logic aligned with the inferred interface. Validation & Utility checks functional correctness and evaluates downstream usefulness to reveal failure modes.

Agent2World overall pipeline

Abstract

Tool-Genesis is a task-driven benchmark for self-evolving language agents that evaluates whether they can create reusable tools from abstract requirements, rather than simply call predefined APIs. It measures the full tool-creation process—from interface inference and executable implementation to functional validation and downstream utility—providing an execution-grounded view of where tool creation succeeds or fails in practice.

To address this, we propose Tool-Genesis, a diagnostic benchmark designed to quantify agent capabilities across multiple dimensions—from interface compliance and functional correctness to downstream utility. It evaluates the ability of agents to construct task-relevant tools solely from abstract requirements (without pre-set specifications) and solve realistic problems.

Crucially, we find that even state-of-the-art models struggle to construct precise tool interfaces or executable logic in a one-shot setting. These minor initial flaws are amplified through the pipeline, leading to a precipitous drop in downstream metrics. We hope this benchmark will guide future research toward steering models to synthesize persistent, general-purpose tools capable of addressing broader real-world challenges.

Leaderboard

Sorted by SR (descending) · Click any column header to sort
L1: Surface Compliance
L2: Semantic Interface Fidelity
L3: Functional Correctness
L4: Downstream Task Utility
Best in Column

BibTeX


wait for the paper