Personalized instruction remains a major bottleneck in higher education, especially in large classes where timely, individualized feedback is difficult to achieve. Existing automation typically relies on rigid rule-based pipelines or computationally heavy deep learning models, making it difficult to simultaneously achieve interpretability, instructional usability, and scalable deployment. In this study, we present CalcTutor, a generative-AI-based assessment and feedback system designed to support open-ended handwritten calculus problem solving. The system organizes instructional support through three coordinated components: (1) a multi-agent large language model (LLM) mechanism that evaluates solution processes and produces diagnostic feedback, (2) a retrieval-augmented generation (RAG) pipeline that links diagnosed difficulties to aligned instructional materials, and (3) real-time learner analytics for both students and instructors, forming an integrated instructional support workflow rather than an automated answer-checking tool. In offline evaluation and a pilot classroom deployment, the multi-agent grader achieved a weighted agreement accuracy of 0.931 and an F1-score of 0.934 on 1055 handwritten solutions. Participant feedback and workflow testing indicated that CalcTutor can be stably integrated into routine classroom use and enables students to interpret and act upon the provided feedback. These results indicate that automated assessment, diagnostic feedback, and targeted review can operate coherently within a single instructional process that supports instructor-led assessment practices. Using undergraduate calculus as an application domain for open-ended handwritten mathematical assessment, the study demonstrates the operational feasibility of a closed-loop assessment–feedback–revision workflow and provides a deployable instructional infrastructure for formative instructional support in real classroom contexts.
Tan et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: