March 3, 2026Open Access

AI and Measurement Concerns: Dealing with Imbalanced Data in Autoscoring

Key Points

Mitigating bias in proficiency estimates enhances prediction accuracy for autoscoring systems.
All examined methods—resampling, active resampling, synonym replacement, and generative AI—effectively improved performance.
Comparison of metrics like accuracy, QWK, and F1 showcased varied performance based on data balance and method used.
Findings highlight the importance of data representativeness, suggesting targeted data strategies can enhance outcomes.

Abstract

Abstract Unbiasedness for proficiency estimates is important for autoscoring engines since the outcome might be used for future learning or placement. Imbalanced training data may lead to certain biases and lower the prediction accuracy for classification algorithms. In this article, we investigated several data augmentation methods to lower the negative effect of imbalanced data in measurement settings. Four approaches were examined: (1) Resampling methods, either oversampling or undersampling; (2) Active resampling methods, where the resampling weight is based on representativeness in the training set; (3) Data expansion methods using synonym Replacement, slightly changing the meaning or semantics of the original answers; and (4) Content recreation method using Generative AI (e.g., ChatGPT) to create responses for less populated scores. We compared the performance (e.g., Accuracy, QWK, F 1) as well as the distance metric for different combinations of the methods. Two datasets with different imbalanced distributions were used. Results show that all four methods can help to mitigate the bias issue and the efficacy was influenced by the imbalance level, representativeness of the original data and the level of increment in the variety of the response (i.e., lexical diversity). In general, resampling and GenAI with active resampling showed the best overall performance.

AI and Measurement Concerns: Dealing with Imbalanced Data in Autoscoring

Key Points

Abstract

Cite This Study