What question did this study set out to answer?

The study aims to enhance personalized feedback for handwritten calculus assessments in large classroom settings.

March 26, 2026Open Access

CalcTutor: Multi-Agent LLM Grading of Handwritten Mathematics with RAG-Grounded Feedback for Adaptive Learning Support

Puntos clave

The study aims to enhance personalized feedback for handwritten calculus assessments in large classroom settings.
Developed CalcTutor, an AI-based grading and feedback system for open-ended calculus problems.
Implemented a multi-agent large language model for evaluating handwritten solutions.
Utilized a retrieval-augmented generation pipeline to link feedback to instructional materials.
Conducted offline evaluation and pilot deployment in classroom settings.
Achieved a weighted agreement accuracy of 0.931 and F1-score of 0.934 on 1055 handwritten solutions.
Participant feedback indicated successful integration into routine classroom use.
Demonstrated the feasibility of a closed-loop assessment and feedback workflow.

Resumen

Personalized instruction remains a major bottleneck in higher education, especially in large classes where timely, individualized feedback is difficult to achieve. Existing automation typically relies on rigid rule-based pipelines or computationally heavy deep learning models, making it difficult to simultaneously achieve interpretability, instructional usability, and scalable deployment. In this study, we present CalcTutor, a generative-AI-based assessment and feedback system designed to support open-ended handwritten calculus problem solving. The system organizes instructional support through three coordinated components: (1) a multi-agent large language model (LLM) mechanism that evaluates solution processes and produces diagnostic feedback, (2) a retrieval-augmented generation (RAG) pipeline that links diagnosed difficulties to aligned instructional materials, and (3) real-time learner analytics for both students and instructors, forming an integrated instructional support workflow rather than an automated answer-checking tool. In offline evaluation and a pilot classroom deployment, the multi-agent grader achieved a weighted agreement accuracy of 0.931 and an F1-score of 0.934 on 1055 handwritten solutions. Participant feedback and workflow testing indicated that CalcTutor can be stably integrated into routine classroom use and enables students to interpret and act upon the provided feedback. These results indicate that automated assessment, diagnostic feedback, and targeted review can operate coherently within a single instructional process that supports instructor-led assessment practices. Using undergraduate calculus as an application domain for open-ended handwritten mathematical assessment, the study demonstrates the operational feasibility of a closed-loop assessment–feedback–revision workflow and provides a deployable instructional infrastructure for formative instructional support in real classroom contexts.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo