What question did this study set out to answer?

This research aims to evaluate the reliability and effectiveness of AI-generated content detection tools in higher education.

March 21, 2026Open Access

Trusting AI to detect AI? A systematic evaluation of the reliability and robustness of current AIGC detection tools for student academic work

Key Points

This research aims to evaluate the reliability and effectiveness of AI-generated content detection tools in higher education.
Constructed three large-scale datasets (StuTask, StuThesis, DataCode) with over 280,000 samples.
Evaluated 13 mainstream detectors (commercial and open-source) on various dimensions.
Analyzed performance across different academic tasks and disciplines.
Detectors perform adequately on long-form theses but fail in short-form coursework and engineering code.
Significant algorithmic bias is noted in STEM disciplines due to formulaic technical writing.
A simple editing strategy allows 88% of AI-generated content to evade detection, revealing vulnerabilities.

Abstract

The explosive proliferation of Generative Artificial Intelligence (GenAI) has posed an unprecedented existential challenge to traditional evaluation systems in higher education. In response to this wave, the higher education sector is undergoing a profound transformation, shifting from a focus on technological containment toward an assessment governance paradigm. Understanding the characteristics of detectors is the foundational prerequisite for utilizing such tools to optimize educational assessment; however, their effectiveness within authentic educational contexts remains insufficiently explored. To address this research gap, the study constructs three large-scale ecological datasets—StuTask, StuThesis, and DataCode—comprising over 280,000 authentic samples of student coursework, academic theses, and engineering code. A systematic evaluation is conducted on 13 mainstream detectors (encompassing both commercial and open-source models) across multiple dimensions, including overall performance, task complexity, disciplinary variations, and adversarial robustness. The results indicate that while detectors achieve acceptable performance on long-form theses, they exhibit systematic failures in engineering code and short-form coursework tasks. Due to the formulaic nature of technical writing, STEM disciplines are subject to significant algorithmic bias. Furthermore, robustness tests reveal extreme vulnerability of current detection tools: a simple hybrid editing strategy can enable 88% of AI-generated content to evade detection successfully. These findings suggest that existing detection technologies are inadequate to support high-stakes educational assessments. In the future educational trajectory of ‘embracing AI,’ AIGC detectors should function as reference metrics within the assessment system, serving to quantify the depth of human-AI collaboration across distinct disciplinary logical frameworks. Furthermore, detection technology must evolve toward the optimization of ‘logical innovation recognition,’ thereby establishing a robust academic integrity defense line that is truly resilient within the future ecosystem of human-AI symbiosis. • Constructed three datasets using authentic student assignments, theses, and code. • Evaluation of 13 detectors reveals systematic failures across academic tasks and disciplines. • Acceptable detector performance remains inadequate for high-stakes assessment. • Hybrid adversarial edits allow 88% of AI-generated content to evade detection. • Findings urge a shift from reliance on tools to process-oriented assessment.

Trusting AI to detect AI? A systematic evaluation of the reliability and robustness of current AIGC detection tools for student academic work

Key Points

Abstract

Cite This Study