What question did this study set out to answer?

This research aims to evaluate how accurately large language models reflect human psychological ratings for early-acquired English words.

February 5, 2026Open Access

How well do large language models mirror human cognition of word concepts?: A comparison of psychological ratings for early-acquired English words

Key Points

This research aims to evaluate how accurately large language models reflect human psychological ratings for early-acquired English words.
Examined four state-of-the-art large language models including GPT-4o and Meta-Llama-3.1.
Evaluated 21 static psychological features for 695 early-acquired English words.
Compared LLM estimates with human psychological norms for various features.
Assessed how LLM and human-derived features predicted words' age of acquisition.
LLMs aligned well with human ratings for features such as Concreteness (r> .82).
Notable divergence was found for features like Iconicity and Arousal (r< .48).
Function words displayed greater discrepancies in ratings compared to content words.
Correlations for age of acquisition predictions varied between models, ranging from −.27 to .28.

Abstract

Abstract This study examined how well large language models (LLMs) approximate human psychological ratings for early-acquired English words. We used four state-of-the-art LLMs, including GPT-4o and Meta-Llama-3.1, to evaluate 21 static psychological features for 695 words and compared these estimates with human norms. The results showed that LLMs aligned well with human ratings for some features (e.g., Concreteness, Bodily Interactiveness) in terms of rank correlations ( r s > .82) and distributional similarities but diverged notably for others (e.g., Iconicity, Arousal; r s < .48). Compared with content words, function words showed more pronounced discrepancies between human and LLM ratings. We also assessed how similarly human- and LLM-derived psychological features predicted words’ age of acquisition (AoA), revealing both strong correspondences and systematic biases, depending on the model (differences in correlations ranged from −.27 to .28). Based on these analyses, we identified which features may be reliably estimated using LLMs, which require further refinement, and what methodological considerations are necessary for applying LLM-based measures in cognitive science. We discuss the implications of using LLMs as methodological tools in psychology and cognitive science, highlighting both their practical advantages (e.g., data coverage and data collection efficiency) and theoretical relevance. The present study provides a novel framework for evaluating the cognitive plausibility of LLMs by using lexical psychological features, complementing existing benchmarks.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Hiromichi Hagihara

Kazuki Miyazawa

Journals

Behavior Research Methods

Actions

Institutions

The University of Tokyo

The University of Osaka

Toneyama National Hospital

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

How well do large language models mirror human cognition of word concepts?: A comparison of psychological ratings for early-acquired English words

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider