Abstract This study examined how well large language models (LLMs) approximate human psychological ratings for early-acquired English words. We used four state-of-the-art LLMs, including GPT-4o and Meta-Llama-3.1, to evaluate 21 static psychological features for 695 words and compared these estimates with human norms. The results showed that LLMs aligned well with human ratings for some features (e.g., Concreteness, Bodily Interactiveness) in terms of rank correlations ( r s > .82) and distributional similarities but diverged notably for others (e.g., Iconicity, Arousal; r s < .48). Compared with content words, function words showed more pronounced discrepancies between human and LLM ratings. We also assessed how similarly human- and LLM-derived psychological features predicted words’ age of acquisition (AoA), revealing both strong correspondences and systematic biases, depending on the model (differences in correlations ranged from −.27 to .28). Based on these analyses, we identified which features may be reliably estimated using LLMs, which require further refinement, and what methodological considerations are necessary for applying LLM-based measures in cognitive science. We discuss the implications of using LLMs as methodological tools in psychology and cognitive science, highlighting both their practical advantages (e.g., data coverage and data collection efficiency) and theoretical relevance. The present study provides a novel framework for evaluating the cognitive plausibility of LLMs by using lexical psychological features, complementing existing benchmarks.
Building similarity graph...
Analyzing shared references across papers
Loading...
Hiromichi Hagihara
Kazuki Miyazawa
Behavior Research Methods
The University of Tokyo
The University of Osaka
Toneyama National Hospital
Building similarity graph...
Analyzing shared references across papers
Loading...
Hagihara et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69843405f1d9ada3c1fb1b1c — DOI: https://doi.org/10.3758/s13428-025-02938-2
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: