What question did this study set out to answer?

This study aims to explore how Large Language Models can automate the process of linking assessment items to knowledge components, improving efficiency and accuracy.

April 25, 2026Open Access

Large Language Models for Learning Technologies: Structural and Functional Evaluation of Automated Q-Matrix Generation for Educational Assessment

Key Points

This study aims to explore how Large Language Models can automate the process of linking assessment items to knowledge components, improving efficiency and accuracy.
Developed a dual-level computational evaluation framework assessing both structural alignment with expert-defined Q-matrices and functional predictive performance.
Conducted controlled experiments comparing few-shot prompting and chain-of-thought prompting in generating Q-matrices.
Utilized RMSE and MAE to evaluate the model’s predictive performance within the DINA cognitive diagnostic model.
Few-shot prompting significantly improved structural alignment compared to zero-shot configurations.
Chain-of-thought prompting provided minor improvements in alignment.
Predictive performance remained stable and comparable to expert-defined Q-matrices, confirming diagnostic model robustness.

Abstract

Linking assessment items to knowledge components (KCs) is essential for adaptive learning and cognitive diagnosis but remains a labor-intensive and expert-dependent process. This study investigates the use of Large Language Models (LLM) to automate item–KC association through Q-matrix generation as an interpretable representation of item–attribute mappings. In contrast to prior studies that focus solely on tagging accuracy or linguistic classification, we propose a dual-level computational evaluation framework that jointly assesses (i) structural alignment with expert-defined Q-matrices using formal similarity and discrepancy metrics, and (ii) functional impact on predictive performance within the DINA cognitive diagnostic model using RMSE and MAE under controlled experimental settings. The results show that few-shot prompting substantially improves structural alignment compared with zero-shot configurations, whereas chain-of-thought provides marginal refinements. Despite structural variations across prompting strategies, the predictive performance remains stable and comparable to expert-defined Q-matrices, indicating the functional robustness of the diagnostic model. The main contribution of this study lies in providing an integrated structural–functional evaluation protocol for LLM-based knowledge tagging, offering empirical evidence that automated item–KC linking can support scalable assessment design without degrading the diagnostic accuracy.

Large Language Models for Learning Technologies: Structural and Functional Evaluation of Automated Q-Matrix Generation for Educational Assessment

Key Points

Abstract

Cite This Study