With the rapid advancement of deepfake generation technologies, the development of robust detection mechanisms has become crucial. This study investigates the capability of state-of-the-art Large Language Models (LLMs) in detecting deepfake images through a comprehensive binary classification framework. We evaluate four prominent multimodal LLMs-GPT-4, Gemini 2.5 Pro, DeepSeek R1, and Sonar-using a balanced dataset of 100 real and 100 fake images sourced from CelebHQ-FM and FFHQ-FM Face Manipulation datasets. Our methodology employs binary classification to distinguish between authentic and manipulated facial images, with performance evaluated using confusion matrices and ROC curve analysis. Results demonstrate that GPT-4 achieves the highest overall performance with 87.5% accuracy and 91.4% precision, followed by Gemini 2.5 Pro (84.0% accuracy), while DeepSeek R1 and Sonar show comparable performance at approximately 74% accuracy. This research contributes to the growing field of AI-based forensic analysis by providing empirical evidence of LLMs' potential as deepfake detection tools and establishing baseline performance metrics for future comparative studies.
Kafle et al. (Thu,) studied this question.