October 8, 2025Open Access

MMSciBench: Benchmarking Language Models on Chinese Multimodal Scientific Problems

Puntos clave

MMSciBench found that top models only reached 63.77% accuracy in reasoning tasks, indicating significant shortcomings.
Even leading models struggled notably with visual reasoning, pointing to urgent needs for improvement in multimodal tasks.
This analysis establishes MMSciBench as a vital benchmark for advancing research in multimodal scientific reasoning.
With open-source code and a comprehensive dataset, MMSciBench aims to promote enhanced evaluation methodologies in the field.

Resumen

Recent advances in large language models (LLMs) and vision-language models (LVLMs) have shown promise across many tasks, yet their scientific reasoning capabilities remain untested, particularly in multimodal settings. We present MMSciBench, a benchmark for evaluating mathematical and physical reasoning through text-only and text-image formats, with human-annotated difficulty levels, solutions with detailed explanations, and taxonomic mappings. Evaluation of state-of-the-art models reveals significant limitations, with even the best model achieving only 63. 77\% accuracy and particularly struggling with visual reasoning tasks. Our analysis exposes critical gaps in complex reasoning and visual-textual integration, establishing MMSciBench as a rigorous standard for measuring progress in multimodal scientific understanding. The code for MMSciBench is open-sourced at GitHub, and the dataset is available at Hugging Face.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo

Cite This Study

Ye et al. (Thu,) studied this question.

synapsesocial.com/papers/68e6a0f4718ef0a556b33d66 https://doi.org/https://doi.org/10.48550/arxiv.2503.01891

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo