This study presents a comparative benchmark analysis of eight AI text detection platforms using a multi-domain dataset comprising 125 documents, including both human-written and AI-generated texts. Each document was evaluated by WordBinary, QuillBot, Originality, Grammarly, Copyleaks, NoteGPT, GPTZero, and Turnitin, with platform outputs analysed as continuous AI confidence percentages rather than binary classifications. The study evaluates detector behaviour across multiple dimensions, including confidence scoring intensity, human-versus-AI separation, inter-platform agreement, calibration, stability, threshold robustness, domain sensitivity, and generator sensitivity. Results show that while all evaluated systems achieved perfect discrimination under conventional classification thresholds in this benchmark dataset, substantial differences emerged in confidence magnitude, calibration behaviour, and scoring consistency. These findings suggest that AI detector benchmarking should move beyond binary accuracy and instead evaluate comparative confidence behaviour across platforms. This Zenodo record contains the benchmark preprint manuscript.
Charles Dr Montgomery (Tue,) studied this question.