Large language models (LLMs) have begun to function as assistants or teammates in language learning, teaching, and research. However, what prerequisites are required for LLMs to reliably play these roles, and how such prerequisites should be measured, remains under-discussed. This study focuses on measuring Pedagogical Grammar Pattern Recognition (P-GPR) and establishes the Chinese Pedagogical Grammar Evaluation (CPG-EVAL), a multi-tiered benchmark designed to evaluate P-GPR within International Chinese Language Education. CPG-EVAL operationalizes grammar–instance correspondence through five task types that progressively increase contextual load and interference. We evaluate multiple proprietary and open-source LLMs as well as human participants. Results show a monotonic ordering across groups (humans > larger-scale models > semi-larger-scale models > smaller-scale models). In comparison with human participants, LLM performance is more sensitive to task-format complexity. In addition, we identify a set of completely failed items that consistently mislead all evaluated LLMs, exposing shared and systematic weaknesses in current models’ pedagogical grammar recognition. Overall, this study provides an operational framework for diagnosing the capabilities and risks of LLMs when they are deployed as assistants or teammates in grammar-related language-education tasks and offers empirical reference for safer and more syllabus-aligned use of LLMs in educational settings.
Dong Wang (Fri,) studied this question.