March 3, 2024Open Access

InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Multimodal Large Language Models (MLLMs) have experienced significant advancements recently. Nevertheless, challenges persist in the accurate recognition and comprehension of intricate details within high-resolution images. Despite being indispensable for the development of robust MLLMs, this area remains underinvestigated. To tackle this challenge, our work introduces InfiMM-HD, a novel architecture specifically designed for processing images of different resolutions with low computational overhead. This innovation facilitates the enlargement of MLLMs to higher-resolution capabilities. InfiMM-HD incorporates a cross-attention module and visual windows to reduce computation costs. By integrating this architectural design with a four-stage training pipeline, our model attains improved visual perception efficiently and cost-effectively. Empirical study underscores the robustness and effectiveness of InfiMM-HD, opening new avenues for exploration in related areas. Codes and models can be found at https://huggingface.co/Infi-MM/infimm-hd

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo

Cite This Study

Liu et al. (Sun,) studied this question.

synapsesocial.com/papers/68e75efdb6db6435876d60a0 https://doi.org/https://doi.org/10.48550/arxiv.2403.01487

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo