Tool-Genesis is a task-driven benchmark for self-evolving language agents that evaluates whether they can create reusable tools from abstract requirements, rather than simply call predefined APIs. It measures the full tool-creation process—from interface inference and executable implementation to functional validation and downstream utility—providing an execution-grounded view of where tool creation succeeds or fails in practice.
To address this, we propose Tool-Genesis, a diagnostic benchmark designed to quantify agent capabilities across multiple dimensions—from interface compliance and functional correctness to downstream utility. It evaluates the ability of agents to construct task-relevant tools solely from abstract requirements (without pre-set specifications) and solve realistic problems.
Crucially, we find that even state-of-the-art models struggle to construct precise tool interfaces or executable logic in a one-shot setting. These minor initial flaws are amplified through the pipeline, leading to a precipitous drop in downstream metrics. We hope this benchmark will guide future research toward steering models to synthesize persistent, general-purpose tools capable of addressing broader real-world challenges.
wait for the paper