This study introduces an intelligent, large language model (LLM)–driven feedback system designed to assess and enhance students’ programming tasks through semantic comparison and pedagogically contextualized feedback. Unlike traditional grading systems, our system analyzes Python submissions against a reference solution and generates feedback along three main dimensions: logic, style, and performance. The system employs sentence-embedding-based semantic similarity to determine alignment and adaptively adjusts the feedback based on submission quality. Thirty-one student solutions (both reference-level and imperfect submissions) were tested in this study. The results show a mean similarity score of 0.56 (SD = 0.19) and a moderate inverse correlation (r = − 0.65) between feedback length and similarity, confirming adaptive behavior in the system. Visual examination, such as the category-based distribution of feedback, similarity patterns, and solution clustering, further demonstrates the validity and explainability of the system. This approach ensures reproducibility through the transparent definition of reference tasks, embedded similarity scoring and qualitative pattern analysis. The system has implications for AI-facilitated formative feedback, mass code assessment, and adaptive tutoring in computer science education.
Theeb et al. (Mon,) studied this question.