August 28, 2024Open Access

LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

Key Points

Key points are not available for this paper at this time.

Abstract

Recent advances in large vision-language models (VLMs) typically employ vision encoders based on the Vision Transformer (ViT) architecture. The division of the images into patches by ViT results in a fragmented perception, thereby hindering the visual understanding capabilities of VLMs. In this paper, we propose an innovative enhancement to address this limitation by introducing a Scene Graph Expression (SGE) module in VLMs. This module extracts and structurally expresses the complex semantic information within images, thereby improving the foundational perception and understanding abilities of VLMs. Extensive experiments demonstrate that integrating our SGE module significantly enhances the VLM's performance in vision-language tasks, indicating its effectiveness in preserving intricate semantic details and facilitating better visual understanding.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Wang et al. (Wed,) studied this question.

www.synapsesocial.com/papers/68e5aa67b6db643587544b15 — DOI: https://doi.org/10.48550/arxiv.2408.16224

Authors

Jingyi Wang

Jianzhong Ju

Jian Luan

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion