As artificial intelligence (AI) systems increasingly match human performance on standardized mental state recognition tasks, the question is no longer whether AI is human-level, but which humans define that level. This question remains underexplored. This study evaluates GPT-5 mini against the full spectrum of human ability, not just average performance, on standardized forced-choice emotion and mental state recognition tasks including the Reading the Mind in the Eyes Test (RMET) and the Multiracial Reading the Mind in the Eyes Test (MRMET). At the individual level, GPT-5 mini outperforms human participants across nearly all performance levels on both the RMET and MRMET. Yet this advantage is reversed when independent responses are aggregated. Using bootstrap-resampled plurality voting to aggregate independent responses, we find that human responses significantly outperform those of GPT-5 mini. This wisdom-of-crowds effect cannot be replicated through repeated sampling from AI models. Furthermore, an augmented approach that aggregates bootstrapped human and AI responses together outperforms either source alone. These findings suggest that evaluating AI against average human performance risks mistaking AI mediocrity for human excellence. We discuss the implications of these findings for combining human and machine intelligence to surpass what either achieves in isolation.
Akben et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: