What type of study is this?

This is a Quantitative Study study.

October 18, 2025Open Access

Evaluating the performance of general purpose large language models in identifying human facial emotions

Key Points

GPT-4o and Gemini 2.0 showed high performance in recognizing emotions, achieving results comparable to human levels.
The study involved three leading LLMs using the NimStim dataset, with strong performance especially in calm and surprise categories.
Despite overall strong agreement with ground truth, fear was frequently misclassified by all models, highlighting a limitation.
Findings emphasize the increasing socioemotional competence of LLMs, suggesting they could be valuable tools in healthcare settings.

Abstract

We evaluated the ability of three leading LLMs (GPT-4o, Gemini 2.0 Experimental, and Claude 3.5 Sonnet) to recognize human facial expression using the NimStim dataset. GPT and Gemini matched or exceeded human performance, especially for calm/neutral and surprise. All models showed strong agreement with ground truth, though fear was often misclassified. Findings underscore the growing socioemotional competence of LLMs and their potential for healthcare applications.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper