The explosive proliferation of Generative Artificial Intelligence (GenAI) has posed an unprecedented existential challenge to traditional evaluation systems in higher education. In response to this wave, the higher education sector is undergoing a profound transformation, shifting from a focus on technological containment toward an assessment governance paradigm. Understanding the characteristics of detectors is the foundational prerequisite for utilizing such tools to optimize educational assessment; however, their effectiveness within authentic educational contexts remains insufficiently explored. To address this research gap, the study constructs three large-scale ecological datasets—StuTask, StuThesis, and DataCode—comprising over 280,000 authentic samples of student coursework, academic theses, and engineering code. A systematic evaluation is conducted on 13 mainstream detectors (encompassing both commercial and open-source models) across multiple dimensions, including overall performance, task complexity, disciplinary variations, and adversarial robustness. The results indicate that while detectors achieve acceptable performance on long-form theses, they exhibit systematic failures in engineering code and short-form coursework tasks. Due to the formulaic nature of technical writing, STEM disciplines are subject to significant algorithmic bias. Furthermore, robustness tests reveal extreme vulnerability of current detection tools: a simple hybrid editing strategy can enable 88% of AI-generated content to evade detection successfully. These findings suggest that existing detection technologies are inadequate to support high-stakes educational assessments. In the future educational trajectory of ‘embracing AI,’ AIGC detectors should function as reference metrics within the assessment system, serving to quantify the depth of human-AI collaboration across distinct disciplinary logical frameworks. Furthermore, detection technology must evolve toward the optimization of ‘logical innovation recognition,’ thereby establishing a robust academic integrity defense line that is truly resilient within the future ecosystem of human-AI symbiosis. • Constructed three datasets using authentic student assignments, theses, and code. • Evaluation of 13 detectors reveals systematic failures across academic tasks and disciplines. • Acceptable detector performance remains inadequate for high-stakes assessment. • Hybrid adversarial edits allow 88% of AI-generated content to evade detection. • Findings urge a shift from reliance on tools to process-oriented assessment.
Sun et al. (Sun,) studied this question.