What does this research mean for the field?

MCP augmentation reduces the resolution rate of LLM-based agents by 14.9% while improving efficiency, resulting in fewer tool calls, tokens, and costs. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.CHALLENGES_CONSENSUS.

What question did this study set out to answer?

The aim is to evaluate the effectiveness of Model Context Protocol servers on task completion by LLM-based agents.

February 16, 2026Open Access

mcpbr: Benchmarking Model Context Protocol Servers on Software Engineering Tasks

Key Points

The aim is to evaluate the effectiveness of Model Context Protocol servers on task completion by LLM-based agents.
Developed an open-source benchmark runner called mcpbr.
Conducted paired comparison experiments using 500 tasks from SWE-bench Verified.
Evaluated a specific code graph analysis MCP server with Claude Sonnet as the base agent.
MCP tool augmentation reduced the resolution rate by 14.9%.
Efficiency improved with 42.3% fewer tool calls and 14.0% fewer tokens.
The MCP server benefitted only 1 out of 12 repositories, negatively impacting 10.

Abstract

The Model Context Protocol (MCP) lets developers expose tools and data sources to LLM-based agents through a standardized interface. Despite rapid ecosystem growth, no methodology exists for evaluating whether a given MCP server improves agent task completion. We present mcpbr, an open-source benchmark runner that isolates the effect of MCP tool augmentation through paired comparison experiments. We evaluate a code graph analysis MCP server on all 500 tasks from SWE-bench Verified using Claude Sonnet as the base agent. MCP augmentation reduced resolution rate by 14.9% (from 49.8% to 42.4%) while improving efficiency: 42.3% fewer tool calls, 14.0% fewer tokens, and 15.2% lower cost. Per-repository analysis shows the effect varies across codebases, with the server helping on 1 of 12 repositories and hurting on 10. We analyze this efficiency-resolution tradeoff and show that MCP tools alter the agent's exploration strategy, trading general-purpose search for opinionated shortcuts that can narrow the solution space.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper

Cite This Study

Grey Newell (Thu,) studied this question.

synapsesocial.com/papers/6992b4ad9b75e639e9b09add https://doi.org/https://doi.org/10.5281/zenodo.18627369

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper