What question did this study set out to answer?

The study aims to measure the prerequisites for large language models to function effectively in language education as assistants and teammates.

February 9, 2026Open Access

CPG-EVAL: Evaluating the Readiness of Large Language Models as Assistants and Teammates in Language Teaching

Key Points

The study aims to measure the prerequisites for large language models to function effectively in language education as assistants and teammates.
Development of the Chinese Pedagogical Grammar Evaluation (CPG-EVAL) benchmark
Assessment of pedagogical grammar pattern recognition (P-GPR) in multiple LLMs and human participants
Comparison of performance across different model sizes and task complexities
Human participants outperformed all evaluated large language models
Larger models performed better than medium and smaller models
LLMs showed sensitivity to task-format complexity with consistent failures in certain tasks

Abstract

Large language models (LLMs) have begun to function as assistants or teammates in language learning, teaching, and research. However, what prerequisites are required for LLMs to reliably play these roles, and how such prerequisites should be measured, remains under-discussed. This study focuses on measuring Pedagogical Grammar Pattern Recognition (P-GPR) and establishes the Chinese Pedagogical Grammar Evaluation (CPG-EVAL), a multi-tiered benchmark designed to evaluate P-GPR within International Chinese Language Education. CPG-EVAL operationalizes grammar–instance correspondence through five task types that progressively increase contextual load and interference. We evaluate multiple proprietary and open-source LLMs as well as human participants. Results show a monotonic ordering across groups (humans > larger-scale models > semi-larger-scale models > smaller-scale models). In comparison with human participants, LLM performance is more sensitive to task-format complexity. In addition, we identify a set of completely failed items that consistently mislead all evaluated LLMs, exposing shared and systematic weaknesses in current models’ pedagogical grammar recognition. Overall, this study provides an operational framework for diagnosing the capabilities and risks of LLMs when they are deployed as assistants or teammates in grammar-related language-education tasks and offers empirical reference for safer and more syllabus-aligned use of LLMs in educational settings.

Bookmark

View Full Paper

Bookmark

View Full Paper

CPG-EVAL: Evaluating the Readiness of Large Language Models as Assistants and Teammates in Language Teaching

Key Points

Abstract

Cite This Study