With the improved reasoning capabilities of large language models (LLMs), their applications have rapidly expanded across a wide range of tasks. In recent question answering tasks, performance gains have been achieved through Self-Consistency, where LLMs generate multiple reasoning paths and determine the final answer via majority voting. However, this approach can fail when the correct answer is generated but does not appear frequently enough to be selected, highlighting its vulnerability to inconsistent generations. To address this, we propose Adaptive Confidence Re-scoring (ACR)—a method that adaptively evaluates and re-scores candidate answers to select the most trustworthy one when LLMs fail to generate consistent reasoning. Experiments on arithmetic and logical reasoning benchmarks show that ACR maintains or improves answer accuracy while significantly reducing inference cost. Compared to existing verification methods such as FOBAR, ACR reduces the number of inference calls by up to 95%, while improving inference efficiency—measured as accuracy gain per inference call—by a factor of 2× to 17×, depending on the dataset and model.
Building similarity graph...
Analyzing shared references across papers
Loading...
Eunhye Jeong
Yong Suk Choi
Applied Sciences
Hanyang University
Building similarity graph...
Analyzing shared references across papers
Loading...
Jeong et al. (Sat,) studied this question.
www.synapsesocial.com/papers/68bb5f586d6d5674bcd039d2 — DOI: https://doi.org/10.3390/app15179587