Evaluating Theory of Mind (ToM) in Large Language Models (LLMs) is an important area of research for understanding the social intelligence of artificial intelligence. Recent ToM benchmarks have significantly enhanced the complexity, comprehensiveness, and practicality of evaluations.However, while the focus has been on constructing “more difficult” or “more comprehensive” tasks, systematic analysis of structural factors that inherently determine the difficulty of ToM reasoning, i.e., “what” makes reasoning difficult, is insufficient.Hence, we propose a new dataset generation framework for ToM evaluation, named AnaToM.To realize an “anatomy of difficulty” in ToM reasoning, AnaToM strictly controls structural parameters such as the number of entities and the timeline in a story.This parameter control enables the isolation and identification of factors affecting ToM in LLMs, thereby enabling a more precise examination of their reasoning mechanisms.The proposed framework provides a systematic methodology for diagnosing the structural limits of LLM reasoning abilities, thus offering a foundational diagnostic baseline that complements the evaluation of broader sociocognitive capabilities in future benchmark designs.
Suzuki et al. (Thu,) studied this question.