What question did this study set out to answer?

This research compares the performance of Large Language Models to high-school students on chemistry exams.

March 30, 2026Open Access

Benchmarking AI on Standard Chemistry Exams: LLMs Still Underperform Compared to High School Students

Key Points

This research compares the performance of Large Language Models to high-school students on chemistry exams.
Evaluated three LLMs on standardized multiple-choice chemistry questions.
Conducted a regression analysis to identify challenging question characteristics for LLMs.
Analyzed exam items with chemistry education experts to characterize LLM failures.
LLMs significantly underperformed compared to over 139,000 high-school students.
Visual elements and multi-step reasoning tasks were identified as challenging for LLMs.

Abstract

Abstract As Large Language Models (LLMs) become increasingly prevalent in science education, it is important to understand their capabilities compared to human learners with respect to authentic learning tasks. Such understanding is crucial for designing AI-resilient assessments and developing AI tutors that can guide students in problem solving. Using standardized assessments as benchmarks allows these comparisons to be based on widely accepted educational criteria. To date, most educational benchmarks have been developed and evaluated in English, with other languages receiving far less attention. The present study addresses this gap by introducing the first Hebrew science education benchmark, based on the national high-school matriculation exam in chemistry. We evaluated three LLMs – ChatGPT 4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro – on 120 multiple-choice questions and compared their performance to that of over 139,000 high-school students. We found that all three LLMs significantly underperformed relative to human learners. To investigate characteristics that render questions more challenging for LLMs, we conducted a regression analysis and found that visual elements and multi-step reasoning tasks negatively impacted their performance. Finally, chemistry education experts analyzed the items that were most difficult for LLMs and characterized their domain-specific failures. This study makes three contributions: (1) it extends LLM evaluation to an underrepresented linguistic context; (2) it advances the methodological landscape of LLM benchmarking by directly comparing multiple models with human students on authentic, curriculum-aligned national examinations; and (3) it provides a mixed-methods analysis of LLM performance, offering a more educationally grounded characterization of current model capabilities.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper