BACKGROUND: Robust screening for AI-generated scientific text is increasingly required by journals, yet detector performance on full-length manuscripts remains unclear. Six widely used detectors were benchmarked on dentistry manuscripts, and performance was compared across state-of-the-art generators. METHODS: A total of 120 manuscripts was assembled in four groups (GPT-4.5, GPT-4o, DeepSeek-R2, and human-written; n = 30 each). Aidetector, GPTZero, Copyleaks, Originality.AI, Turnitin, and DetectingAI produced 0-100% scores. Discrimination was assessed with ROC/AUC and DeLong tests with Bonferroni correction. A pre-specified 60% threshold yielded document-level classifications. Inter-detector agreement was quantified with Cohen's κ. RESULTS: Four detectors showed high discrimination; pairwise AUC differences among these were not significant after Bonferroni correction and each outperformed Turnitin and DetectingAI. Using the 60% cut-off, sensitivities and specificities were high, with no false positives observed among the 30 human texts included in this sample. Agreement was almost perfect among the high performers, slight for Turnitin, and none for DetectingAI. Across tools, DeepSeek-R2 texts were the easiest to detect. CONCLUSIONS: On full dentistry manuscripts, four detectors showed high discrimination, whereas Turnitin showed moderate performance and DetectingAI was ineffective. A percentage-based 60% decision threshold provided reproducible, manuscript-level calls. CLINICAL RELEVANCE: These results may help dental editors and reviewers compare detectors for preliminary screening of AI-generated text under controlled conditions.
Villa et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: