What question did this study set out to answer?

This study evaluates the effectiveness of multimodal large language models in K-12 science education by examining their accuracy, explanation quality, and operational efficiency.

June 18, 2026

Beyond accuracy: cognitive, pedagogical, and practical readiness of multimodal LLMs for K-12 science education

Key Points

This study evaluates the effectiveness of multimodal large language models in K-12 science education by examining their accuracy, explanation quality, and operational efficiency.
Evaluated ten state-of-the-art MLLMs using the ScienceQA benchmark in a zero-shot chain-of-thought format.
Assessed models across dimensions: accuracy, Pedagogical Explanation Score, and reasoning latency.
Conducted fine-tuning experiments on the Qwen2-VL model to measure performance improvements.
Some models showed fluent explanations but lacked adaptive scaffolding for learners.
Qwen2-VL demonstrated substantial accuracy gains with low latency, suitable for real-time educational use.
Revealed a visual redundancy paradox where added images may increase cognitive load and degrade performance.

Abstract

Integrating Multimodal Large Language Models (MLLMs) into K–12 science education is promising, yet existing evaluations mainly focus on answer accuracy and overlook explanation quality, cognitive appropriateness, and response efficiency. This study systematically evaluates ten state-of-the-art MLLMs to examine whether they can answer correctly, explain pedagogically, and operate efficiently in K–12 science contexts. Using the ScienceQA benchmark, we conduct a zero-shot chain-of-thought evaluation across three dimensions: accuracy, Pedagogical Explanation Score, and reasoning latency, with results interpreted through cognitive load theory. The findings reveal a “visual redundancy paradox”: adding images can degrade performance by increasing extraneous cognitive load. Although most models generate fluent explanations, they still lack adaptive scaffolding for learners. Fine-tuning experiments further show that the locally deployed Qwen2-VL model achieves substantial accuracy gains while maintaining low latency, indicating the value of lightweight domain adaptation for real-time educational applications. Based on these results, we propose a latency-aware model selection strategy: locally deployed Qwen2-VL and low-latency API services are suitable for high-frequency formative assessment, while higher-performing flagship models are better reserved for deep inquiry or batch-style instructional support.

KI fragen

Bookmark

Cite This Study

Jiang et al. (Mon,) studied this question.

synapsesocial.com/papers/6a338d85630953a74978e701 https://doi.org/https://doi.org/10.1080/10494820.2026.2679730

KI fragen

Bookmark