What type of study is this?

This is a Cohort Study study (also classified as: Quantitative Study).

October 1, 2025Open Access

Machines flunking an exam: Evaluating large language models on course-related open questions

Key Points

LLMs scored lower than students in responding to course-related open questions, especially glossary queries.
Expert- and machine-assigned grades were used to evaluate LLM responses, highlighting performance gaps.
Correlation analysis identified effective metrics for automated assessment of LLM answers.
Refinements to LLM operations are recommended to enhance their utility in educational settings.

Abstract

Abstract Large language models (LLMs) have grown rapidly in China since the rise of ChatGPT, touching many fields. These models offer a promising solution for responding to students’ questions while learning. However, most relevant research has addressed English-language and multiple-choice questions, namely on general or medical topics; LLMs’ performance in languages such as Chinese and in answering course-related open questions is less clear. To evaluate the performance of LLMs for answering course-related open questions in Chinese, this study explores how well LLMs respond to open-ended, course-specific queries in Chinese (glossary, short-answer, and essay questions). Answers from six LLMs were evaluated based on expert- and machine-assigned grades. Correlation analysis revealed which metrics were most efficacious in automated assessment. We next compared each LLM’s scores to determine the three best-performing models, which were then used for comparison with students. Overall, the selected LLMs’ performance was unsatisfactory: these models demonstrated lower scores relative to students, especially on glossary questions. We have thus recommended several ways to refine LLMs’ operation. These suggestions can help to promote the spread of LLMs and serve as a reference for students and educators when asking course-related open questions. The implications, limitations, and research directions arising from our study are also discussed.

Machines flunking an exam: Evaluating large language models on course-related open questions

Key Points

Abstract

Cite This Study

Also Consider

Also Consider