PIR-Bench is an integrated, reproducible benchmark suite of 20+ physics discovery tasks across five difficulty levels, providing parameterized YAML-configured sweeps, SQLite experiment tracking, a formal Discovery Confidence Score, and automated paper-ready reporting for systematic evaluation of physical law discovery methods. Version 3 expands PIR-Bench from 9 to 20 tasks across five difficulty levels (L1–L5), with 100% discovery rate confirmed for all tasks using the OT hybrid loss with flow matching prior. New tasks added (Version 3): L3: harmonicₒscillatorₗagrangian, harmonicₒscillatorₛtructuredₗagrangian, harmonicₒscillatorₓdot, harmonicₒscillatorᵥdot L4: planarᵣobotfkₓ, planarᵣobotfkᵧ, doubleₚendulumₜheta1dot, doubleₚendulumₜheta2dot L5: planarᵣobotⱼ11, planarᵣobotⱼ12, planarᵣobotⱼ21, planarᵣobotⱼ22 (full Jacobian matrix) Benchmark results: All 20 tasks: 100% DR, mean DCS 0. 818 (n=200, noise=0. 01, 5 seeds) Tables 5–6 (noise/size robustness): 100% DR confirmed for 4-task core suite across σ ∈ 0. 001, 0. 01, 0. 05, 0. 10, 0. 20 and n ∈ 50, 100, 200, 500, 1000 Robot Jacobian robustness: 100% DR at σ ∈ 0. 0, 0. 01, 0. 05, 15 runs per task (13 hours total runtime). High-noise sweep (σ=0. 10, 0. 20) is future work Mean model fit confidence for Jacobian tasks: 0. 971–0. 973 across J11–J22 Triple-stack clarification: The score-based diffusion and JEPA components are implemented in the PIR codebase but are not activated in any benchmark run reported in this paper. All tables reflect the OT+flow-prior configuration (--hybrid-ot --use-flow-prior). The Diffusion+JEPA ablation will be reported in the PIR-JEPA paper. Infrastructure note: The readᵣesults. py reporter aggregates all JSON artifacts in the results/ directory. Stale artifacts from prior runs must be archived before running the reporter to avoid spurious DR reductions from incompatible run mixing. The benchmark runner summary is the authoritative source for any individual run set. Reproducibility: python rundiscoverybenchmark. py --hybrid-ot --experiments all \ --base-seed 0 --repeats 5 --dataset-sizes 200 --noise-levels 0. 01 \ --alpha 0. 7 --beta 0. 3 --output-dir resultsᵥ3 Robot Jacobian: python rundiscoverybenchmark. py --hybrid-ot --no-dim-filter \ --experiments planarᵣobotⱼ11 planarᵣobotⱼ12 \ planarᵣobotⱼ21 planarᵣobotⱼ22 \ --dataset-sizes 200 --repeats 5 Version history: V1 (2026-03-22): 9 tasks, Kepler 20%, gravity 0%. V2 (2026-03-26): 9/9 tasks, Fix A–C, numerical equivalence criterion. V3 (2026-04-05): 20/20 tasks, Jacobian + Lagrangian tasks added, triple-stack status clarified. Version 3. 1 (April 2026): Robot Jacobian noise robustness extended to full sweep σ ∈ 0. 0, 0. 01, 0. 05, 0. 10, 0. 20. 100% DR confirmed at all noise levels for all four Jacobian entries (J11–J22). Table 7 regenerated with per-noise-level DCS and MAE. Limitations section updated. No other changes from v3.
Building similarity graph...
Analyzing shared references across papers
Loading...
Muhammad Hanif
Building similarity graph...
Analyzing shared references across papers
Loading...
Muhammad Hanif (Mon,) studied this question.
www.synapsesocial.com/papers/69d5f17974eaea4b11a7af8e — DOI: https://doi.org/10.5281/zenodo.19435081
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: