What question did this study set out to answer?

The study aims to assess how well radiologists and AI models can differentiate between ChatGPT-generated synthetic radiographs and real images.

March 26, 2026

The Rise of Deepfake Medical Imaging: Radiologists’ Diagnostic Accuracy in Detecting ChatGPT-generated Radiographs

Key Points

The study aims to assess how well radiologists and AI models can differentiate between ChatGPT-generated synthetic radiographs and real images.
Conducted a retrospective diagnostic accuracy study with 17 radiologists across six countries.
Radiologists evaluated 154 radiographs (77 synthetic, 77 authentic) in phase 1 without knowing the study focus.
In phase 2, radiologists identified whether radiographs were synthetic after learning the study's purpose.
Phase 3 involved analyzing an additional 110 chest radiographs (55 synthetic, 55 authentic) for comparison.
Used McNemar test and t-test for statistical analysis of results.
41% of radiologists initially recognized AI-generated radiographs without prompts.
Overall accuracy was similar for identifying synthetic images in both GPT-4o dataset (75%) and RoentGen dataset (70%).
LLMs performed variably in detection; GPT-4o and GPT-5 had higher accuracy (85% and 83% respectively).
Common synthetic radiograph features included bilateral symmetry and overly smooth textures.

Abstract

Background Large language models (LLMs) can generate realistic synthetic medical images (deepfakes), which raise concerns about potential misuse. Purpose To assess the ability of radiologists and multimodal LLMs to distinguish ChatGPT-generated synthetic radiographs from authentic clinical images. Materials and Methods This retrospective diagnostic accuracy study conducted between April and August 2025 included 17 practicing radiologists from six countries with varying experience levels. In phase 1, the radiologists, blinded to the purpose of the study, assessed image quality and provided diagnoses for 154 radiographs from multiple anatomic regions (77 synthetic images generated using ChatGPT GPT-4o; OpenAI and 77 authentic images). In phase 2, after being informed of the study's purpose, the radiologists determined whether randomly presented radiographs were GPT-4o-generated or authentic. The same classification task was performed by four LLMs: GPT-4o, GPT-5 (OpenAI), Gemini 2.5 Pro (Google), and Llama 4 Maverick (Meta). In phase 3, an additional set of 110 chest radiographs (55 synthetic images generated using RoentGen and 55 authentic images) was analyzed to evaluate the performance of readers and LLMs in distinguishing synthetic versus authentic images. The McNemar test and t test were used for comparisons. Results Forty-one percent (seven of 17) of purpose-blinded radiologists spontaneously identified artificial intelligence-generated radiographs as being present in the dataset. After being informed that some radiographs were synthetic, there was no evidence of a difference in overall accuracy among all 17 radiologists in distinguishing synthetic images in the GPT-4o dataset (75% 95% CI: 68, 81) versus in the RoentGen dataset (70% 95% CI: 62, 78; P = .07). No tested LLM detected all synthetic radiographs in either dataset; however, GPT-4o-generated radiographs were more accurately differentiated from authentic ones by GPT-4o (accuracy, 85%) and GPT-5 (accuracy, 83%) compared with Llama 4 Maverick (accuracy, 59%) and Gemini 2.5 Pro (accuracy, 56%) (all P https://noneedanick.github.io/DeepFakeXRay/. © RSNA, 2026 Supplemental material is available for this article. See also the editorial by Bhayana and Krishna in this issue.

Mark Helpful

Bookmark

Relay