Effective scientific literature retrieval requires moving beyond surface-level term matching toward structured semantic reasoning. This paper presents a controlled empirical study of multi-stage retrieval for scientific literature, integrating lexical matching, dense semantic modeling, hybrid fusion, and cross-encoder re-ranking within a unified evaluation framework. The study is designed to analyze the interactions, trade-offs, and failure modes of these components in claim-based scientific search. Experiments on the SciFact benchmark demonstrate that dense models capture semantic similarity but remain insufficient when used in isolation. Hybrid fusion broadens the candidate pool but does not consistently outperform the best standalone dense retriever, as RRF-based fusion can dilute strong dense rankings when lexical and semantic signals diverge. Cross-encoder re-ranking proves to be the primary driver of final performance gains, with the best configuration, Hybrid (SciNCL + BM25) + Cross-Encoder, reaching NDCG@10 of 0.523, MAP@10 of 0.479, Recall@10 of 0.642, and MRR@10 of 0.497. Ablation analysis shows that lexical pseudo-relevance feedback (RM3) introduces query drift in claim-focused retrieval, and that passage-level max pooling weakens effectiveness by fragmenting document-level evidence. Cross-domain evaluation on SciFact, PubMedQA, and SciDocs demonstrates that the relative ranking of retrieval paradigms remains stable across datasets with varying difficulty levels, while also revealing that the RRF dilution effect intensifies on harder retrieval tasks. These findings suggest that effective scientific retrieval benefits from integrated multi-stage pipelines, and that understanding component-level interactions is essential for designing robust retrieval systems.
Al-Joofi et al. (Tue,) studied this question.