What question did this study set out to answer?

This research aims to analyze structural factors determining the difficulty of Theory of Mind reasoning in language models.

June 17, 2026Open Access

AnaToM: A Dataset Generation Framework for Evaluating Theory of Mind Reasoning Toward Anatomy of Difficulty Through Structurally Controlled Story Generation

Key Points

This research aims to analyze structural factors determining the difficulty of Theory of Mind reasoning in language models.
Developed AnaToM, a dataset generation framework that controls structural parameters like entity number and timeline.
Systematically analyzed the effects of these parameters on Theory of Mind evaluation in language models.
Identified specific structural factors that significantly impact the difficulty of ToM reasoning.
Established a foundational diagnostic baseline for evaluating sociocognitive capabilities in future benchmarks.

Abstract

Evaluating Theory of Mind (ToM) in Large Language Models (LLMs) is an important area of research for understanding the social intelligence of artificial intelligence. Recent ToM benchmarks have significantly enhanced the complexity, comprehensiveness, and practicality of evaluations.However, while the focus has been on constructing “more difficult” or “more comprehensive” tasks, systematic analysis of structural factors that inherently determine the difficulty of ToM reasoning, i.e., “what” makes reasoning difficult, is insufficient.Hence, we propose a new dataset generation framework for ToM evaluation, named AnaToM.To realize an “anatomy of difficulty” in ToM reasoning, AnaToM strictly controls structural parameters such as the number of entities and the timeline in a story.This parameter control enables the isolation and identification of factors affecting ToM in LLMs, thereby enabling a more precise examination of their reasoning mechanisms.The proposed framework provides a systematic methodology for diagnosing the structural limits of LLM reasoning abilities, thus offering a foundational diagnostic baseline that complements the evaluation of broader sociocognitive capabilities in future benchmark designs.

KI fragen

Bookmark

View Full Paper

Cite This Study

Suzuki et al. (Thu,) studied this question.

synapsesocial.com/papers/6a323957d50b63ecad204d06 https://doi.org/https://doi.org/10.5715/jnlp.33.630

KI fragen

Bookmark

View Full Paper