Integrating Multimodal Large Language Models (MLLMs) into K–12 science education is promising, yet existing evaluations mainly focus on answer accuracy and overlook explanation quality, cognitive appropriateness, and response efficiency. This study systematically evaluates ten state-of-the-art MLLMs to examine whether they can answer correctly, explain pedagogically, and operate efficiently in K–12 science contexts. Using the ScienceQA benchmark, we conduct a zero-shot chain-of-thought evaluation across three dimensions: accuracy, Pedagogical Explanation Score, and reasoning latency, with results interpreted through cognitive load theory. The findings reveal a “visual redundancy paradox”: adding images can degrade performance by increasing extraneous cognitive load. Although most models generate fluent explanations, they still lack adaptive scaffolding for learners. Fine-tuning experiments further show that the locally deployed Qwen2-VL model achieves substantial accuracy gains while maintaining low latency, indicating the value of lightweight domain adaptation for real-time educational applications. Based on these results, we propose a latency-aware model selection strategy: locally deployed Qwen2-VL and low-latency API services are suitable for high-frequency formative assessment, while higher-performing flagship models are better reserved for deep inquiry or batch-style instructional support.
Jiang et al. (Mon,) studied this question.