What question did this study set out to answer?

The aim is to establish a comprehensive benchmarking suite for evaluating methods in discovering physical laws.

April 8, 2026Open Access

PIR-Bench: A Reproducible Benchmark Suite for Automated Physical Law Discovery — Version 3

Key Points

The aim is to establish a comprehensive benchmarking suite for evaluating methods in discovering physical laws.
Integrated benchmark suite comprising 20 physics discovery tasks across five difficulty levels.
Utilization of parameterized YAML-configured sweeps and SQLite for experiment tracking.
Implementation of a formal Discovery Confidence Score and automated paper-ready reporting.
Confirmed 100% discovery rate across all 20 tasks using OT hybrid loss with flow matching prior.
Mean Discovery Confidence Score of 0.818 with variations in dataset sizes and noise levels.
Robustness in discovering physical laws at different noise levels confirmed for Jacobian tasks.

Abstract

PIR-Bench is an integrated, reproducible benchmark suite of 20+ physics discovery tasks across five difficulty levels, providing parameterized YAML-configured sweeps, SQLite experiment tracking, a formal Discovery Confidence Score, and automated paper-ready reporting for systematic evaluation of physical law discovery methods. Version 3 expands PIR-Bench from 9 to 20 tasks across five difficulty levels (L1–L5), with 100% discovery rate confirmed for all tasks using the OT hybrid loss with flow matching prior. New tasks added (Version 3): L3: harmonicₒscillatorₗagrangian, harmonicₒscillatorₛtructuredₗagrangian, harmonicₒscillatorₓdot, harmonicₒscillatorᵥdot L4: planarᵣobotfkₓ, planarᵣobotfkᵧ, doubleₚendulumₜheta1dot, doubleₚendulumₜheta2dot L5: planarᵣobotⱼ11, planarᵣobotⱼ12, planarᵣobotⱼ21, planarᵣobotⱼ22 (full Jacobian matrix) Benchmark results: All 20 tasks: 100% DR, mean DCS 0. 818 (n=200, noise=0. 01, 5 seeds) Tables 5–6 (noise/size robustness): 100% DR confirmed for 4-task core suite across σ ∈ 0. 001, 0. 01, 0. 05, 0. 10, 0. 20 and n ∈ 50, 100, 200, 500, 1000 Robot Jacobian robustness: 100% DR at σ ∈ 0. 0, 0. 01, 0. 05, 15 runs per task (13 hours total runtime). High-noise sweep (σ=0. 10, 0. 20) is future work Mean model fit confidence for Jacobian tasks: 0. 971–0. 973 across J11–J22 Triple-stack clarification: The score-based diffusion and JEPA components are implemented in the PIR codebase but are not activated in any benchmark run reported in this paper. All tables reflect the OT+flow-prior configuration (--hybrid-ot --use-flow-prior). The Diffusion+JEPA ablation will be reported in the PIR-JEPA paper. Infrastructure note: The readᵣesults. py reporter aggregates all JSON artifacts in the results/ directory. Stale artifacts from prior runs must be archived before running the reporter to avoid spurious DR reductions from incompatible run mixing. The benchmark runner summary is the authoritative source for any individual run set. Reproducibility: python rundiscoverybenchmark. py --hybrid-ot --experiments all \ --base-seed 0 --repeats 5 --dataset-sizes 200 --noise-levels 0. 01 \ --alpha 0. 7 --beta 0. 3 --output-dir resultsᵥ3 Robot Jacobian: python rundiscoverybenchmark. py --hybrid-ot --no-dim-filter \ --experiments planarᵣobotⱼ11 planarᵣobotⱼ12 \ planarᵣobotⱼ21 planarᵣobotⱼ22 \ --dataset-sizes 200 --repeats 5 Version history: V1 (2026-03-22): 9 tasks, Kepler 20%, gravity 0%. V2 (2026-03-26): 9/9 tasks, Fix A–C, numerical equivalence criterion. V3 (2026-04-05): 20/20 tasks, Jacobian + Lagrangian tasks added, triple-stack status clarified. Version 3. 1 (April 2026): Robot Jacobian noise robustness extended to full sweep σ ∈ 0. 0, 0. 01, 0. 05, 0. 10, 0. 20. 100% DR confirmed at all noise levels for all four Jacobian entries (J11–J22). Table 7 regenerated with per-noise-level DCS and MAE. Limitations section updated. No other changes from v3.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper