Recent advances in large language models (LLMs) and vision-language models (LVLMs) have shown promise across many tasks, yet their scientific reasoning capabilities remain untested, particularly in multimodal settings. We present MMSciBench, a benchmark for evaluating mathematical and physical reasoning through text-only and text-image formats, with human-annotated difficulty levels, solutions with detailed explanations, and taxonomic mappings. Evaluation of state-of-the-art models reveals significant limitations, with even the best model achieving only 63. 77\% accuracy and particularly struggling with visual reasoning tasks. Our analysis exposes critical gaps in complex reasoning and visual-textual integration, establishing MMSciBench as a rigorous standard for measuring progress in multimodal scientific understanding. The code for MMSciBench is open-sourced at GitHub, and the dataset is available at Hugging Face.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xinwu Ye
Chengfan Li
Siming Chen
Building similarity graph...
Analyzing shared references across papers
Loading...
Ye et al. (Thu,) studied this question.
www.synapsesocial.com/papers/68e6a0f4718ef0a556b33d66 — DOI: https://doi.org/10.48550/arxiv.2503.01891
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: