What question did this study set out to answer?

This research aims to evaluate the effectiveness of multi-dimensional scoring in process reward models for domain-specific reasoning tasks.

April 28, 2026Open Access

Cheap PRMs: Multi-Dimensional Process Reward Modeling for Domain-Specialized Reasoning

Key Points

This research aims to evaluate the effectiveness of multi-dimensional scoring in process reward models for domain-specific reasoning tasks.
Developed a 7B QLoRA-trained Process Reward Model (NPC Fin-PRM 7B) tested on a financial analysis task.
Employed four scoring dimensions: factual accuracy, logical validity, completeness, and risk awareness.
Utilized a validation set of 200 examples to assess performance against judge labels.
Achieved a Spearman correlation of 0.92 with judge labels and a rating accuracy of 88.5%.
Only 5.2% mis-flagged misclassified steps in out-of-distribution evaluations with a mean score of 0.856.
The model displayed poor calibration (ECE = 0.21), leading to over- and under-flags in specific score ranges.

Abstract

Process Reward Models (PRMs) score individual reasoning steps for correctness and have become central to recent reasoning-model training pipelines. Most published PRMs are math-only and use a single scalar correctness signal; we examine whether multi-dimensional scoring on a domain-specialized reasoning task (DeFi/crypto financial analysis) is worth the extra modeling surface. We describe NPC Fin-PRM 7B, a 7B QLoRA-trained PRM scored on four dimensions factualₐccuracy, logicalᵥalidity, completeness, and riskₐwareness trained in 17. 4 hours on a single H100 from approximately 80, 000 step-level judge labels generated by Qwen2. 5-72B over 4, 866 reasoning trees. The judge is served locally on the same H100 via vLLM. On a stratified 200-example held-out validation split, the model achieves Spearman 0. 92 against judge labels (rating accuracy 88. 5%, error-detection F1 0. 84) at MAE 0. 04 on the 0-1 score scale. Out-of-distribution evaluation on 307 gold-correct math-reasoning steps from GSM8K and MATH-500 finds only 5. 2% mis-flagged as flawed and a mean overallₛcore of 0. 856 — substantially better cross-domain transfer than expected for a DeFi-only training corpus, with a side-effect that the model extrapolates beyond its training labels by emitting EXCELLENT and PERFECT ratings on 3. 9% of OOD math steps despite never being trained to produce them. Two findings stand out. First, three of the four dimensions form a tightly correlated cluster (pairwise Spearman 0. 85-0. 92) that is largely captured by a single axis; the judge's overall score is 95% explained by logicalᵥalidity alone. Second, despite Spearman 0. 92 the model is poorly calibrated as a probability (ECE = 0. 21): it over-flags in the 0. 1-0. 5 score band and under-flags around 0. 5-0. 7, so the score scale should not be used as a calibrated probability without a Platt-scaling or isotonic-regression sidecar. The contribution is recipe-level: a complete pipeline for domain-specialized process reward modeling on accessible hardware, with the limitations and unmet experiments reported alongside the wins.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper