Background. Large language model (LLM) AI assistants are increasingly used in computational biology workflows for hypothesis generation, but hallucination rates remain a documented concern. Reported hallucination patterns vary substantially by context: from low rates in carefully bounded clinical summarisation (1.47% in Asgari et al. 2025) to higher rates in open-ended prompting (with prompt sensitivity and model variability both contributing significantly). Existing mitigation strategies operate at the model level (prompt engineering, retrieval-augmented generation, human review) and depend critically on human reviewer discipline. The asymmetry between rapid AI generation and slow human verification leaves AI-collaborative biology vulnerable to systematic interpretation errors. Proposal. We propose Adversarial Verification Biology (AVB), a four-layer architecture extending AI red-teaming methodology and Popperian falsificationism to AI-collaborative scientific inference. The architecture comprises: (A) a Generator producing biological hypotheses; (B) a Skeptic Engine with seven attack modules actively attempting to falsify each hypothesis; (C) a Weighted Survivorship Score incorporating an Orthogonality Depth measure to discount correlated verifications; (D) a Failure Ontology providing cumulative epistemic memory across sessions. Empirical pilot (Test 1). We motivate AVB through a single-session case study (ENKI longevity-aging genomic framework) in which nine distinct AI interpretation errors were identified during a 10-hour collaborative analysis. The same 9 errors were then submitted to an independent Skeptic agent based on a different LLM (Kimi, Moonshot AI), with no prior knowledge of the original session. The independent Skeptic identified all 9 errors correctly as Invalid (9/9 recall, binomial p = 0.0020 under chance H0=0.5), spanning seven distinct error categories (structural, annotation, pipeline, statistical-logical, literal output, test-design, meta-statistical). The result exceeds the pre-registered threshold (>=7/9 = promising). Important methodological caveat: the test prompts disclosed key data points that enabled Skeptic reasoning, limiting generalisation to fully adversarial settings. Empirical validation plan (forthcoming). The AVB Benchmark v1.0 - 100 biological claims (50 validated, 50 retracted/non-reproducible) with formal FPR/FNR/calibration metrics - remains the principal empirical commitment. Test 1 provides preliminary evidence motivating Benchmark v1.0 construction. Position. AI-collaborative computational biology should transition from cooperative AI assistance to adversarial AI verification. The pilot result supports the architectural commitment but does not constitute full validation. This paper applies AVB recursively to itself: v0.1 was challenged by Skeptic review (producing v0.2); v0.2's central architectural claim was tested empirically (producing v0.2.1); v0.2.2 incorporates further corrections after a third Skeptic pass identified hallucinated bibliographic references and policy non-compliance issues. Each iteration strengthens the framework while maintaining transparency about what remains unverified. Companion case study: ENKI longevity-aging framework, Zenodo DOI 10.5281/zenodo.18613292. Keywords: AI-human collaboration, adversarial verification, computational biology, hallucination mitigation, red-teaming, falsificationism, failure ontology, orthogonality, empirical pilot, reproducibility.
benito arnaldo silvestri (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: