October 5, 2025Open Access

Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability

Key Points

GPT-4o fails to consistently apply domain knowledge during image generation and editing tasks.
Evaluation metrics indicate persistent limitations, especially in conditional reasoning and instruction fidelity.
The systematic study across three dimensions reveals significant gaps in GPT-4o's multimodal generation capabilities.
Findings suggest a need for enhanced benchmarks and training strategies to improve context-aware generation.

Abstract

OpenAI's multimodal GPT-4o has demonstrated remarkable capabilities in image generation and editing, yet its ability to achieve world knowledge-informed semantic synthesis--seamlessly integrating domain knowledge, contextual reasoning, and instruction adherence--remains unproven. In this study, we systematically evaluate these capabilities across three critical dimensions: (1) Global Instruction Adherence, (2) Fine-Grained Editing Precision, and (3) Post-Generation Reasoning. While existing benchmarks highlight GPT-4o's strong capabilities in image generation and editing, our evaluation reveals GPT-4o's persistent limitations: the model frequently defaults to literal interpretations of instructions, inconsistently applies knowledge constraints, and struggles with conditional reasoning tasks. These findings challenge prevailing assumptions about GPT-4o's unified understanding and generation capabilities, exposing significant gaps in its dynamic knowledge integration. Our study calls for the development of more robust benchmarks and training strategies that go beyond surface-level alignment, emphasizing context-aware and reasoning-grounded multimodal generation.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper