What type of study is this?

October 16, 2025

GPT-based Creativity Assessment: Focusing on Comparison with Human Experts

Key Points

AI models provided comparable creativity assessments to expert evaluations, indicating potential for broader use.
The study utilized evaluations from GPT-4.1 and GPT-4o models on tasks involving 99 middle school students.
Agreement measures, including Pearson’s r and intraclass correlation, were used to evaluate AI assessments against expert evaluations.
Establishing optimal settings for AI assessments could enhance efficiency in creativity evaluation without compromising quality.

Abstract

The development of generative artificial intelligence (AI) is significantly reshaping creativity research, particularly regarding its potential for effectively assessing creative products. Evaluations of creative outputs have traditionally relied on expert-based Consensual Assessment Technique (CAT); however, CAT demands substantial time and resources to achieve high reliability. Thus, it has become necessary to investigate AI-driven evaluation methods capable of complementing or substituting expert assessments. This study compared creativity evaluations conducted using GPT-4.1 and GPT-4o—multi-modal large language models (LLMs)—on creative title-generation tasks by 99 middle school students with CAT evaluations provided by six creativity experts. Specifically, evaluations were repeatedly conducted under varying conditions of model, prompt type, and temperature. Agreement among GPT evaluations was measured by percentage agreement and intraclass correlation (ICC), whereas consistency with CAT evaluations was examined using Pearson’s r, Spearman’s rho, and root mean square error (RMSE). The analyses revealed optimal GPT model, prompt, and temperature settings consistent with CAT, providing practical guidelines for GPT-based creativity assessments. This study contributes foundational insights for designing and implementing AI-based evaluations that align with CAT principles.

اسأل الذكاء الاصطناعي

Bookmark

Cite This Study

Lee et al. (Tue,) studied this question.

synapsesocial.com/papers/68f04acce559138a1a06ea09 https://doi.org/https://doi.org/10.36358/jce.2025.25.3.1

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

اسأل الذكاء الاصطناعي

Bookmark