What question did this study set out to answer?

This study aims to develop a mechanism that improves the effectiveness of question answering systems by adaptively adjusting model contributions based on question features.

June 5, 2026Open Access

Adaptive confidence ensemble reranking for reliable knowledge-intensive question answering

Key Points

This study aims to develop a mechanism that improves the effectiveness of question answering systems by adaptively adjusting model contributions based on question features.
Proposed an adaptive confidence ensemble reranking framework using dual-encoder retrieval and cross-encoder models (BERT, RoBERTa, DeBERTa).
Implemented an instance-aware adaptive confidence weighting mechanism driven by cross-entropy evaluation.
Conducted experiments on multiple benchmark datasets including MS MARCO, WikiQA, and TREC-QA.
Achieved a 9.3% increase in mean average precision on MS MARCO.
Reported a 2.1% gain on WikiQA.
Demonstrated a 1.2% improvement on TREC-QA.

Abstract

Modern question answering systems rely on advanced neural architectures to bridge the semantic gap between natural language queries and relevant textual evidence. However, single-model approaches often struggle to capture diverse linguistic variations, while traditional ensemble methods suffer from high computational complexity and static model combination strategies. Existing answer reranking systems lack adaptive mechanisms to dynamically adjust model contributions based on question characteristics, limiting their effectiveness across knowledge-intensive tasks. This study proposes an adaptive confidence ensemble reranking framework designed to improve reliability and efficiency in knowledge-intensive question answering. The proposed approach integrates dual-encoder retrieval with a cross-encoder ensemble of BERT, RoBERTa, and DeBERTa models, combined through an instance-aware adaptive confidence weighting mechanism. The framework dynamically adjusts model contributions using cross-entropy-based evaluation to optimize answer ranking performance while maintaining computational feasibility. Experimental results demonstrate significant performance improvements across multiple benchmark datasets, including a 9.3% increase in mean average precision on MS MARCO, a 2.1% gain on WikiQA, and a 1.2% improvement on TREC-QA. These findings highlight the effectiveness of the proposed method in enhancing reliability and accuracy across diverse knowledge-intensive question answering scenarios.

Mark Helpful

Bookmark

Relay

View Full Paper