Abstract As Large Language Models (LLMs) become increasingly prevalent in science education, it is important to understand their capabilities compared to human learners with respect to authentic learning tasks. Such understanding is crucial for designing AI-resilient assessments and developing AI tutors that can guide students in problem solving. Using standardized assessments as benchmarks allows these comparisons to be based on widely accepted educational criteria. To date, most educational benchmarks have been developed and evaluated in English, with other languages receiving far less attention. The present study addresses this gap by introducing the first Hebrew science education benchmark, based on the national high-school matriculation exam in chemistry. We evaluated three LLMs – ChatGPT 4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro – on 120 multiple-choice questions and compared their performance to that of over 139,000 high-school students. We found that all three LLMs significantly underperformed relative to human learners. To investigate characteristics that render questions more challenging for LLMs, we conducted a regression analysis and found that visual elements and multi-step reasoning tasks negatively impacted their performance. Finally, chemistry education experts analyzed the items that were most difficult for LLMs and characterized their domain-specific failures. This study makes three contributions: (1) it extends LLM evaluation to an underrepresented linguistic context; (2) it advances the methodological landscape of LLM benchmarking by directly comparing multiple models with human students on authentic, curriculum-aligned national examinations; and (3) it provides a mixed-methods analysis of LLM performance, offering a more educationally grounded characterization of current model capabilities.
Building similarity graph...
Analyzing shared references across papers
Loading...
Elad Yacobson
Yael Schleifer
Ziva Bar-Dov
Journal of Science Education and Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Yacobson et al. (Sat,) studied this question.
www.synapsesocial.com/papers/69c9c5a4f8fdd13afe0bd92c — DOI: https://doi.org/10.1007/s10956-026-10310-y