The rapid progress of generative models has made detecting realistic forgeries a critical challenge for security and trust. Existing image and frequency-based methods depend on dataset-specific artifacts with poor generalization, while Vision-Language Model (VLM)-based methods remain limited by coarse prompts and underused cross-modal alignment. To address these issues, we propose a Fine-grained Text-driven Generative Image Detection (FTGID) framework, which enables comprehensive detection through multi-modal cues. First, we design a Layer-wise Adaptive Global Extractor (LAGE) that stabilizes multi-level global representations through adaptive CLS token fusion with lightweight calibration and parameter-efficient tuning. Second, we propose a Fine-grained Text-guided Local Enhancer (FTLE) that performs patch-level text-visual interaction to enhance the localization of forgery-relevant regions. Third, we introduce a High-frequency Artifact Feature Extractor (HAFE) that adaptively captures discriminative high-frequency cues, enabling more reliable detection of subtle generative artifacts. Extensive experiments demonstrate that FTGID consistently outperforms state-of-the-art GID methods across diverse generative models and unseen datasets, achieving superior performance, thereby enhancing both robustness and interpretability in open-world generative image detection. Our codes will be made publicly available after the peer review process.
Huang et al. (Tue,) studied this question.