Multimodal large language models (MLLMs) offer immense potential for biomedical AI, yet current applications remain limited to coarse-grained image understanding and basic textual queries-falling short of the fine-grained reasoning required in clinical contexts. In this work, we present a comprehensive solution spanning data, model, and training innovations to advance pixel-level multimodal intelligence in biomedicine. First, we construct MeCoVQA, a new visual-language benchmark that spans eight medical imaging modalities and four core tasks, supporting both spatially-grounded reasoning and fine grained diagnostic comprehension. Building on this, we introduce MedPLIB, an end-to-end biomedical MLLM equipped with pixel level visual understanding. MedPLIB supports diverse multi modal tasks-including VQA, point- and region-based querying, grounding, and segmentation-through unified modeling. To further accommodate the heterogeneous nature of biomedical tasks, we design a task-specialized Mixture-of-Experts (MoE) architecture, where each expert is tailored to a specific task and jointly optimized via unified fine-tuning. This modular design accommodates diverse biomedical tasks while maintaining a unified and efficient architecture. By integrating retrieval-augmented generation (RAG) and in-context learning (ICL), MedPLIB also demonstrates strong generalization on out-of-distribution (OOD) medical image segmentation. Experiments across multiple benchmarks show that MedPLIB sets a new state-of-the-art on biomedical vision-language tasks; notably, it outperforms the best existing small and large models by 19.7 and 15.6 mDice in zero shot pixel-level grounding, highlighting its clinical utility and generalization strength. Code and data are publicly available at GitHub: https://github.com/ShawnHuang497/MedPLIB.
Shen et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: