The transformer architecture has triggered groundbreaking works in multimodal vision and language (V+L). This article offers brief look into the two main modeling paradigms—generative and discriminative—from their roots in natural language processing (NLP) specifically generative pre-trained transformer (GPT) and bidirectional encoder representations from transformers (BERT), respectively. The core ideas of these two paradigms are then examined to show how they have been modified to handle V+L tasks, resulting in different architectural paths and pre-training methods. The paradigms are also surveyed by core dimensions, analyzing the challenges along the path from distributed paradigms to unified models (e.g., model hallucination, limited evaluation capability and scalability). This work aims to provide a well-organized and clear view on how V+L modeling has evolved and possibly evolved into for researchers as well as practitioners.
Building similarity graph...
Analyzing shared references across papers
Loading...
F. He
Transactions on Computer Science and Intelligent Systems Research
Building similarity graph...
Analyzing shared references across papers
Loading...
F. He (Tue,) studied this question.
www.synapsesocial.com/papers/68af55d1ad7bf08b1eadc307 — DOI: https://doi.org/10.62051/hdjsgp39
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: