Single-seed evaluation-the dominant reporting practice in small-dataset molecular learning-can substantially inflate performance estimates yet remains largely unexamined. We present the first systematic reproducibility analysis for RNA-ligand binding site prediction by integrating two large pretrained RNA language models (RNA-FM and RiNALMo) across multiple fusion architectures and replicated training runs on the TR60/TE18 benchmark. Our analysis reveals a pronounced Peak-SOTA Paradox: a favorable initialization in the Reverse Cross-Attention model reached an MCC of 0. 353, surpassing the reported state-of-the-art (0. 327), whereas multi-seed replication yielded only 0. 266 0. 020-a 32. 8% overestimation. Across architectures, mean accuracy remained tightly clustered, yet reproducibility varied substantially. Simple concat fusion strategies exhibited markedly higher stability than attention-based models, indicating that architectural entanglement rather than parameter count governs variance under data scarcity. Collectively, these findings establish reproducibility as a primary evaluation criterion for small-sample molecular prediction and motivate a dual-reporting standard in which mean SD serves as the principal metric and peak scores as supplementary evidence. This variance-aware perspective highlights that single-seed evaluations can misrepresent expected performance by 20-30% in limited-sample regimes.
Guan et al. (Thu,) studied this question.