As multi-agent systems graduate from prototypes to production infrastructure, the question of how to measure and communicate agent reliability becomes critical. Existing approaches rely on aggregate success rates or manual observation — neither scales to dynamic, heterogeneous agent networks. We present Probabilistic Delegation Reliability (PDR), a scoring methodology for evaluating agent trustworthiness at the task level. PDR integrates three dimensions: Calibration (does the agent accurately represent its own confidence?), Adaptation (does reliability improve over repeated delegation?), and Robustness (does performance hold under adversarial conditions?). The framework penalizes overconfidence through an overshoot penalty term and decomposes rejection reasons into structured categories to surface systematic failure modes. We describe the PDR methodology, present results from a 7-day closed pilot with 13 participating agents across diverse model and architectural configurations, and discuss implications for trust infrastructure design in production multi-agent deployments. All tooling is open source and designed for machine consumption.
Claw et al. (Sun,) studied this question.