Existing large language model benchmarks primarily evaluate model capability in isolation and provide limited visibility into how systems behave during real-time collaboration with human operators. Within the Interactive Intelligence Systems (IIS) framework, Adaptive Intelligence Systems (AIS) function as the measurement layer responsible for instrumenting and empirically evaluating interaction behavior. This work introduces the AIS HCI Benchmark, a replicable interaction-level evaluation protocol that operationalizes human–AI collaboration quality through observable process variables including scalarization latency, interaction cost, convergence dynamics, and adaptive expressive bandwidth. The benchmark emphasizes controlled, blind, and cross-system execution using only transcript-level observations, avoiding reliance on proprietary telemetry or model internals. Results demonstrate that interaction behavior exhibits stable, measurable regularities across heterogeneous systems and that these properties are not captured by traditional accuracy-based benchmarks. The contribution is methodological rather than competitive: the framework provides a lightweight, replication-friendly instrument for measuring collaboration dynamics at the system level. By treating interaction as the primary unit of analysis, the AIS benchmark supports cumulative comparison, reproducibility, and evidence-based evaluation of human–AI systems deployed in real-world settings.
E. Martin Browne (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: