What question did this study set out to answer?

The aim is to improve fine-grained reasoning in biomedical AI through enhanced multimodal models.

April 19, 2026

From Image to Pixels: towards Fine-Grained Medical Vision-Language Models

Key Points

The aim is to improve fine-grained reasoning in biomedical AI through enhanced multimodal models.
Developed MeCoVQA as a visual-language benchmark covering eight medical imaging modalities.
Introduced MedPLIB, an end-to-end biomedical MLLM focusing on pixel-level visual understanding.
Implemented a Mixture-of-Experts architecture for specialized task optimization.
Utilized retrieval-augmented generation and in-context learning for better generalization.
MedPLIB sets a new state-of-the-art performance on biomedical vision-language tasks.
Outperforms existing models by 19.7 and 15.6 mDice in zero-shot pixel-level grounding.
Demonstrated robust generalization abilities on out-of-distribution medical image segmentation tasks.

Abstract

Multimodal large language models (MLLMs) offer immense potential for biomedical AI, yet current applications remain limited to coarse-grained image understanding and basic textual queries-falling short of the fine-grained reasoning required in clinical contexts. In this work, we present a comprehensive solution spanning data, model, and training innovations to advance pixel-level multimodal intelligence in biomedicine. First, we construct MeCoVQA, a new visual-language benchmark that spans eight medical imaging modalities and four core tasks, supporting both spatially-grounded reasoning and fine grained diagnostic comprehension. Building on this, we introduce MedPLIB, an end-to-end biomedical MLLM equipped with pixel level visual understanding. MedPLIB supports diverse multi modal tasks-including VQA, point- and region-based querying, grounding, and segmentation-through unified modeling. To further accommodate the heterogeneous nature of biomedical tasks, we design a task-specialized Mixture-of-Experts (MoE) architecture, where each expert is tailored to a specific task and jointly optimized via unified fine-tuning. This modular design accommodates diverse biomedical tasks while maintaining a unified and efficient architecture. By integrating retrieval-augmented generation (RAG) and in-context learning (ICL), MedPLIB also demonstrates strong generalization on out-of-distribution (OOD) medical image segmentation. Experiments across multiple benchmarks show that MedPLIB sets a new state-of-the-art on biomedical vision-language tasks; notably, it outperforms the best existing small and large models by 19.7 and 15.6 mDice in zero shot pixel-level grounding, highlighting its clinical utility and generalization strength. Code and data are publicly available at GitHub: https://github.com/ShawnHuang497/MedPLIB.

Bookmark

From Image to Pixels: towards Fine-Grained Medical Vision-Language Models

Key Points

Abstract

Cite This Study

Also Consider

Also Consider