What type of study is this?

This is a Literature Review study.

August 19, 2025Open Access

Generative and Discriminative Models in Multimodal AI: An Analysis of Vision-Language Tasks

Key Points

Generative and discriminative models are crucial for vision-language tasks, evolving in their approaches and architectures.
Generative models, like GPT, enable flexible output generation, while discriminative models, such as BERT, focus on classification.
Observation of challenges along the shift from distributed paradigms to unified models underscores concerns like model hallucination.
Insights from this analysis can guide researchers and practitioners in addressing scalability and evaluation limitations.

Abstract

The transformer architecture has triggered groundbreaking works in multimodal vision and language (V+L). This article offers brief look into the two main modeling paradigms—generative and discriminative—from their roots in natural language processing (NLP) specifically generative pre-trained transformer (GPT) and bidirectional encoder representations from transformers (BERT), respectively. The core ideas of these two paradigms are then examined to show how they have been modified to handle V+L tasks, resulting in different architectural paths and pre-training methods. The paradigms are also surveyed by core dimensions, analyzing the challenges along the path from distributed paradigms to unified models (e.g., model hallucination, limited evaluation capability and scalability). This work aims to provide a well-organized and clear view on how V+L modeling has evolved and possibly evolved into for researchers as well as practitioners.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

F. He

Journals

Transactions on Computer Science and Intelligent Systems Research

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Generative and Discriminative Models in Multimodal AI: An Analysis of Vision-Language Tasks

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider