Despite growing claims about the enhanced capabilities of successive generations of Large Language Models (LLMs), empirical evidence regarding differences in feedback quality and sentiment characteristics remains limited. This study systematically analyzes the sentiment and stylistic features of feedback generated by 18 contemporary LLMs across more than 6,000 student programming assignments. The analyzed models encompassed Anthropic’s claude-3-5-haiku, claude-opus-4-1, and claude-sonnet-4; Deepseek’s deepseek-chat and deepseek-reasoner; Google’s gemini−2.0-flash-lite, gemini−2.0-flash, gemini−2.5-flash-lite, gemini−2.5-flash, and gemini−2.5-pro; and OpenAI’s gpt−4.1-mini, gpt−4.1-nano, gpt−4.1, gpt-4o-mini, gpt-4o, gpt-5-mini, gpt-5-nano, and gpt-5. Using automated sentiment analysis with a RoBERTa-based classifier, the research quantified emotional tone distributions. It examined relationships between sentiment, feedback length, assigned grades, and task characteristics. Results revealed substantial heterogeneity in feedback properties across models. Average comment length ranged from 42 words (claude-haiku−3.5) to more than 270 words (gemini−2.5-flash), and sentiment distributions also differed markedly across models. Hierarchical clustering uncovered two distinct groups based on sentiment patterns, yet these did not align neatly with model architectures or vendor categories. Feedback sentiment correlated strongly with numerical grades (r = 0.707), while negative comments tended to be slightly longer than positive ones. Consistency across models was limited (ICC = 0.061), indicating wide variation in how different LLMs formulate evaluative judgments for identical student responses. Together, these findings show that automated feedback carries systematic emotional and stylistic differences across models, underscoring the importance of careful model selection and calibration in educational contexts.
Marcin Jukiewicz (Mon,) studied this question.