Key points are not available for this paper at this time.
Multiple-choice questions have become ubiquitous in educational measurement because the format allows for efficient and accurate scoring. Nonetheless, there remains continued interest in constructed-response formats. This interest has driven efforts to develop computer-based scoring procedures that can accurately and efficiently score these items. Early procedures were typically based on surface features of the responses or simple matching procedures, but recent developments in natural language processing have allowed for much more sophisticated approaches. This paper reports on a state-of-the-art methodology for scoring short answer questions supported by a large language model. Responses were collected in the context of a high-stakes test for medical students. More than 35,000 responses were collected across 71 studied items. Aggregated across all responses the proportion of agreement with human scores ranged from .97 to .99 (depending on specifics such as training sample size). In addition to reporting detailed results, the paper discusses practical issues that require consideration when adopting this type of scoring system.
Clauser et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: