As the use of artificial intelligence (AI) has become increasingly prevalent in K–12 and post-secondary education, we sought to benchmark the ability of large language models (LLMs) to solve complex problems and demonstrate critical thinking. In this manuscript, we tested the performance of Claude Sonnet 4, an LLM developed by Anthropic, on a publicly available 70-question GRE Physics practice examination. We hypothesized that Claude would correctly answer more than 90% of the questions. To test this, each question was submitted individually to the model, and responses were recorded and scored using the official answer key provided by the Educational Testing Service (ETS). Claude achieved 65 out of 70 correct responses, corresponding to a scaled score of 990, the maximum possible score. These results suggest strong performance of large language models in complex scientific problem-solving tasks.
Theodore Hamilton (Sat,) studied this question.