What question did this study set out to answer?

The aim was to evaluate the performance of the AI model Claude Sonnet 4 on a standardized physics test.

April 27, 2026Open Access

Testing the capability and performance of Claude Sonnet 4 on the GRE physics test

Key Points

The aim was to evaluate the performance of the AI model Claude Sonnet 4 on a standardized physics test.
Tested Claude Sonnet 4 on a 70-question GRE Physics practice exam
Each question was submitted individually for evaluation
Responses were scored using the official answer key from ETS.
Claude Sonnet 4 answered 65 out of 70 questions correctly
Achieved a scaled score of 990, the maximum possible score
Results indicate strong capabilities in scientific problem-solving tasks.

Abstract

As the use of artificial intelligence (AI) has become increasingly prevalent in K–12 and post-secondary education, we sought to benchmark the ability of large language models (LLMs) to solve complex problems and demonstrate critical thinking. In this manuscript, we tested the performance of Claude Sonnet 4, an LLM developed by Anthropic, on a publicly available 70-question GRE Physics practice examination. We hypothesized that Claude would correctly answer more than 90% of the questions. To test this, each question was submitted individually to the model, and responses were recorded and scored using the official answer key provided by the Educational Testing Service (ETS). Claude achieved 65 out of 70 correct responses, corresponding to a scaled score of 990, the maximum possible score. These results suggest strong performance of large language models in complex scientific problem-solving tasks.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper