What question did this study set out to answer?

The aim is to evaluate the differences in sentiment characteristics and feedback quality generated by various large language models for programming assignments.

May 21, 2026Open Access

Multimodel sentiment analysis of feedback generated by large language models in programming assessment

Puntos clave

The aim is to evaluate the differences in sentiment characteristics and feedback quality generated by various large language models for programming assignments.
Analyzed feedback from 18 large language models across over 6,000 programming assignments.
Utilized a RoBERTa-based classifier for automated sentiment analysis.
Employed hierarchical clustering to categorize sentiment patterns from the feedback.
Average comment length varied with models from 42 words to over 270 words.
Strong correlation found between feedback sentiment and assigned grades (r = 0.707).
Limited consistency across models in feedback formulation (ICC = 0.061) with systematic emotional differences.

Resumen

Despite growing claims about the enhanced capabilities of successive generations of Large Language Models (LLMs), empirical evidence regarding differences in feedback quality and sentiment characteristics remains limited. This study systematically analyzes the sentiment and stylistic features of feedback generated by 18 contemporary LLMs across more than 6,000 student programming assignments. The analyzed models encompassed Anthropic’s claude-3-5-haiku, claude-opus-4-1, and claude-sonnet-4; Deepseek’s deepseek-chat and deepseek-reasoner; Google’s gemini−2.0-flash-lite, gemini−2.0-flash, gemini−2.5-flash-lite, gemini−2.5-flash, and gemini−2.5-pro; and OpenAI’s gpt−4.1-mini, gpt−4.1-nano, gpt−4.1, gpt-4o-mini, gpt-4o, gpt-5-mini, gpt-5-nano, and gpt-5. Using automated sentiment analysis with a RoBERTa-based classifier, the research quantified emotional tone distributions. It examined relationships between sentiment, feedback length, assigned grades, and task characteristics. Results revealed substantial heterogeneity in feedback properties across models. Average comment length ranged from 42 words (claude-haiku−3.5) to more than 270 words (gemini−2.5-flash), and sentiment distributions also differed markedly across models. Hierarchical clustering uncovered two distinct groups based on sentiment patterns, yet these did not align neatly with model architectures or vendor categories. Feedback sentiment correlated strongly with numerical grades (r = 0.707), while negative comments tended to be slightly longer than positive ones. Consistency across models was limited (ICC = 0.061), indicating wide variation in how different LLMs formulate evaluative judgments for identical student responses. Together, these findings show that automated feedback carries systematic emotional and stylistic differences across models, underscoring the importance of careful model selection and calibration in educational contexts.

Me gusta

Guardar

Ver artículo completo