What question did this study set out to answer?

The aim is to evaluate and compare the performance of various AI text detection platforms using a multi-domain document dataset.

May 21, 2026Open Access

Comparative Confidence Benchmarking of AI Detection Platforms Across Human and AI-Generated Multi-Domain Documents

Puntos clave

The aim is to evaluate and compare the performance of various AI text detection platforms using a multi-domain document dataset.
Benchmark analysis of eight AI text detection platforms across 125 documents.
Platform outputs analyzed as continuous confidence percentages.
Evaluation across multiple dimensions including calibration and sensitivity.
All systems achieved perfect classification under conventional thresholds.
Significant differences noted in confidence magnitude and calibration behavior.
The benchmarking suggests moving beyond binary accuracy to include comparative confidence evaluation.

Resumen

This study presents a comparative benchmark analysis of eight AI text detection platforms using a multi-domain dataset comprising 125 documents, including both human-written and AI-generated texts. Each document was evaluated by WordBinary, QuillBot, Originality, Grammarly, Copyleaks, NoteGPT, GPTZero, and Turnitin, with platform outputs analysed as continuous AI confidence percentages rather than binary classifications. The study evaluates detector behaviour across multiple dimensions, including confidence scoring intensity, human-versus-AI separation, inter-platform agreement, calibration, stability, threshold robustness, domain sensitivity, and generator sensitivity. Results show that while all evaluated systems achieved perfect discrimination under conventional classification thresholds in this benchmark dataset, substantial differences emerged in confidence magnitude, calibration behaviour, and scoring consistency. These findings suggest that AI detector benchmarking should move beyond binary accuracy and instead evaluate comparative confidence behaviour across platforms. This Zenodo record contains the benchmark preprint manuscript.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo