ABSTRACT As large language models (LLMs) continue to advance, evaluation frameworks must move beyond narrow task accuracy toward structured assessment of higher-order cognitive performance patterns. This study presents and empirically validates RFC-EVAL-001 v1.1, a multidimensional benchmarking protocol designed to assess four cognitive dimensions in artificial intelligence (AI) systems: model complexity, temporal horizon, meta-modeling, and adaptive flexibility. Six state-of-the-art LLMs participated in a complete cross-evaluation design (36 evaluations). Inter-rater reliability was assessed using the intra-class correlation coefficient (ICC(2,5); Shrout Koo & Li, 2016). All four dimensions demonstrated high reliability (ICC range: 0.79–0.88), with bootstrap resampling (10,000 iterations) confirming robustness. Aggregated reliability (ICC(2,k)) exceeded 0.94 across dimensions. Results reveal differentiated cognitive profiles across systems and show that epistemic calibration (self-assessment accuracy) varies independently from overall performance. These findings provide preliminary psychometric evidence supporting RFC-EVAL-001 as a reproducible protocol for multidimensional cognitive profiling in artificial systems, pending replication with larger and more diverse samples. The present contribution validates the measurement instrument itself and provides a methodological foundation for cumulative research on cognitive profiling in AI.
Víctor Cristóbal Bernal Díaz (Sun,) studied this question.