What question did this study set out to answer?

The aim is to tackle scalability issues faced by large language model agents when selecting from extensive tool repositories.

June 4, 2026Open Access

Too Many Tools, Too Much Confusion? Navigating Agentic Tool Selection at Scale

Key Points

The aim is to tackle scalability issues faced by large language model agents when selecting from extensive tool repositories.
Proposed a Retrieval–Plan–Select (RPS) framework for tool selection.
Implemented context-aware query decomposition and synthetic tool description augmentation.
Evaluated the framework using Ultratool, ToolLinkOS, and ToolRet datasets.
Increased recall from 0.340 to 0.494 on Ultratool, 0.208 to 0.323 on ToolLinkOS, and 0.300 to 0.347 on ToolRet.
Improved retrieval quality with Recall@10 rising from 0.288 to 0.403.
Reduced high-similarity semantic collisions by 41.9% at the 0.90 cosine-similarity threshold.

Abstract

This paper addresses the critical scalability challenge that large language model agents face when operating over massive tool repositories. As tool catalogs expand to hundreds or thousands of functions, current architectures exhibit substantial performance degradation caused by semantic collisions between similar tools and ineffective handling of complex multi-tool scenarios. To address these bottlenecks, we propose a recall-first Retrieval–Plan–Select (RPS) framework that combines context-aware query decomposition with synthetic tool description augmentation. The proposed approach explicitly separates retrieval, planning, and final selection through step-local candidate generation, while augmented tool descriptions enriched with expanded summaries and synthetic user questions reduce representation collisions in dense embedding spaces. Evaluation across Ultratool, ToolLinkOS, and ToolRet demonstrates that contextual decomposition consistently improves end-to-end recall under large tool catalogs, increasing recall from 0.340 to 0.494 on Ultratool, from 0.208 to 0.323 on ToolLinkOS, and from 0.300 to 0.347 on ToolRet. Description augmentation further improves retrieval quality, increasing Recall@10 from 0.288 to 0.403 and reducing high-similarity semantic collisions by 41.9% at the 0.90 cosine-similarity threshold. The proposed framework highlights that scalable tool use should be approached primarily as a recall-oriented retrieval and planning problem rather than as a flat in-context selection task, providing practical guidance for building large-scale tool-augmented agents over modern API and MCP-based ecosystems.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper