What question did this study set out to answer?

The aim is to evaluate the performance of AI text detectors on full-length dental manuscripts.

June 11, 2026Open Access

AI text detection in dentistry: a comparative analysis across generative models

Key Points

The aim is to evaluate the performance of AI text detectors on full-length dental manuscripts.
120 manuscripts were analyzed, divided into four groups (30 each): GPT-4.5, GPT-4o, DeepSeek-R2, and human-written.
Six detectors (Aidetector, GPTZero, Copyleaks, Originality.AI, Turnitin, DetectingAI) were benchmarked using ROC/AUC analysis and Bonferroni correction.
A 60% threshold was set for document-level classification with inter-detector agreement assessed using Cohen's κ.
Four detectors demonstrated high discrimination abilities while exceeding Turnitin and DetectingAI, which showed moderate and ineffective performance, respectively.
No false positives were found among the 30 human-written texts, indicating high specificity.
DeepSeek-R2 texts were the easiest to detect across the tools, and nearly perfect agreement was noted among the top-performing detectors.

Abstract

BACKGROUND: Robust screening for AI-generated scientific text is increasingly required by journals, yet detector performance on full-length manuscripts remains unclear. Six widely used detectors were benchmarked on dentistry manuscripts, and performance was compared across state-of-the-art generators. METHODS: A total of 120 manuscripts was assembled in four groups (GPT-4.5, GPT-4o, DeepSeek-R2, and human-written; n = 30 each). Aidetector, GPTZero, Copyleaks, Originality.AI, Turnitin, and DetectingAI produced 0-100% scores. Discrimination was assessed with ROC/AUC and DeLong tests with Bonferroni correction. A pre-specified 60% threshold yielded document-level classifications. Inter-detector agreement was quantified with Cohen's κ. RESULTS: Four detectors showed high discrimination; pairwise AUC differences among these were not significant after Bonferroni correction and each outperformed Turnitin and DetectingAI. Using the 60% cut-off, sensitivities and specificities were high, with no false positives observed among the 30 human texts included in this sample. Agreement was almost perfect among the high performers, slight for Turnitin, and none for DetectingAI. Across tools, DeepSeek-R2 texts were the easiest to detect. CONCLUSIONS: On full dentistry manuscripts, four detectors showed high discrimination, whereas Turnitin showed moderate performance and DetectingAI was ineffective. A percentage-based 60% decision threshold provided reproducible, manuscript-level calls. CLINICAL RELEVANCE: These results may help dental editors and reviewers compare detectors for preliminary screening of AI-generated text under controlled conditions.

Bookmark

View Full Paper

Cite This Study

Villa et al. (Tue,) studied this question.

synapsesocial.com/papers/6a2a4ff180c8f91e7f39ca4c https://doi.org/https://doi.org/10.1186/s41073-026-00228-9

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark

View Full Paper