Background Large language models (LLMs) can generate realistic synthetic medical images (deepfakes), which raise concerns about potential misuse. Purpose To assess the ability of radiologists and multimodal LLMs to distinguish ChatGPT-generated synthetic radiographs from authentic clinical images. Materials and Methods This retrospective diagnostic accuracy study conducted between April and August 2025 included 17 practicing radiologists from six countries with varying experience levels. In phase 1, the radiologists, blinded to the purpose of the study, assessed image quality and provided diagnoses for 154 radiographs from multiple anatomic regions (77 synthetic images generated using ChatGPT GPT-4o; OpenAI and 77 authentic images). In phase 2, after being informed of the study's purpose, the radiologists determined whether randomly presented radiographs were GPT-4o-generated or authentic. The same classification task was performed by four LLMs: GPT-4o, GPT-5 (OpenAI), Gemini 2.5 Pro (Google), and Llama 4 Maverick (Meta). In phase 3, an additional set of 110 chest radiographs (55 synthetic images generated using RoentGen and 55 authentic images) was analyzed to evaluate the performance of readers and LLMs in distinguishing synthetic versus authentic images. The McNemar test and t test were used for comparisons. Results Forty-one percent (seven of 17) of purpose-blinded radiologists spontaneously identified artificial intelligence-generated radiographs as being present in the dataset. After being informed that some radiographs were synthetic, there was no evidence of a difference in overall accuracy among all 17 radiologists in distinguishing synthetic images in the GPT-4o dataset (75% 95% CI: 68, 81) versus in the RoentGen dataset (70% 95% CI: 62, 78; P = .07). No tested LLM detected all synthetic radiographs in either dataset; however, GPT-4o-generated radiographs were more accurately differentiated from authentic ones by GPT-4o (accuracy, 85%) and GPT-5 (accuracy, 83%) compared with Llama 4 Maverick (accuracy, 59%) and Gemini 2.5 Pro (accuracy, 56%) (all P https://noneedanick.github.io/DeepFakeXRay/. © RSNA, 2026 Supplemental material is available for this article. See also the editorial by Bhayana and Krishna in this issue.
Tordjman et al. (Sun,) studied this question.