What question did this study set out to answer?

The aim is to improve the reliability of AI-generated biological hypotheses by mitigating hallucination rates.

May 22, 2026Open Access

Adversarial Verification Biology: A four-layer architecture with cumulative epistemic memory for AI hypothesis robustness in computational biology

Key Points

The aim is to improve the reliability of AI-generated biological hypotheses by mitigating hallucination rates.
Proposed a four-layer architecture for adversarial verification of hypotheses in AI workflows.
Conducted a single-session case study identifying AI interpretation errors and validating them through an independent Skeptic agent.
Planned empirical validation with a benchmark of biological claims to assess verification reliability.
Identified 9 distinct AI errors with an independent Skeptic agent achieving 9/9 recall (p = 0.0020).
Errors spanned seven categories including structural and statistical-logical issues.
The findings support the need for transitioning to adversarial AI verification in computational biology.

Abstract

Background. Large language model (LLM) AI assistants are increasingly used in computational biology workflows for hypothesis generation, but hallucination rates remain a documented concern. Reported hallucination patterns vary substantially by context: from low rates in carefully bounded clinical summarisation (1.47% in Asgari et al. 2025) to higher rates in open-ended prompting (with prompt sensitivity and model variability both contributing significantly). Existing mitigation strategies operate at the model level (prompt engineering, retrieval-augmented generation, human review) and depend critically on human reviewer discipline. The asymmetry between rapid AI generation and slow human verification leaves AI-collaborative biology vulnerable to systematic interpretation errors. Proposal. We propose Adversarial Verification Biology (AVB), a four-layer architecture extending AI red-teaming methodology and Popperian falsificationism to AI-collaborative scientific inference. The architecture comprises: (A) a Generator producing biological hypotheses; (B) a Skeptic Engine with seven attack modules actively attempting to falsify each hypothesis; (C) a Weighted Survivorship Score incorporating an Orthogonality Depth measure to discount correlated verifications; (D) a Failure Ontology providing cumulative epistemic memory across sessions. Empirical pilot (Test 1). We motivate AVB through a single-session case study (ENKI longevity-aging genomic framework) in which nine distinct AI interpretation errors were identified during a 10-hour collaborative analysis. The same 9 errors were then submitted to an independent Skeptic agent based on a different LLM (Kimi, Moonshot AI), with no prior knowledge of the original session. The independent Skeptic identified all 9 errors correctly as Invalid (9/9 recall, binomial p = 0.0020 under chance H0=0.5), spanning seven distinct error categories (structural, annotation, pipeline, statistical-logical, literal output, test-design, meta-statistical). The result exceeds the pre-registered threshold (>=7/9 = promising). Important methodological caveat: the test prompts disclosed key data points that enabled Skeptic reasoning, limiting generalisation to fully adversarial settings. Empirical validation plan (forthcoming). The AVB Benchmark v1.0 - 100 biological claims (50 validated, 50 retracted/non-reproducible) with formal FPR/FNR/calibration metrics - remains the principal empirical commitment. Test 1 provides preliminary evidence motivating Benchmark v1.0 construction. Position. AI-collaborative computational biology should transition from cooperative AI assistance to adversarial AI verification. The pilot result supports the architectural commitment but does not constitute full validation. This paper applies AVB recursively to itself: v0.1 was challenged by Skeptic review (producing v0.2); v0.2's central architectural claim was tested empirically (producing v0.2.1); v0.2.2 incorporates further corrections after a third Skeptic pass identified hallucinated bibliographic references and policy non-compliance issues. Each iteration strengthens the framework while maintaining transparency about what remains unverified. Companion case study: ENKI longevity-aging framework, Zenodo DOI 10.5281/zenodo.18613292. Keywords: AI-human collaboration, adversarial verification, computational biology, hallucination mitigation, red-teaming, falsificationism, failure ontology, orthogonality, empirical pilot, reproducibility.

Adversarial Verification Biology: A four-layer architecture with cumulative epistemic memory for AI hypothesis robustness in computational biology

Key Points

Abstract

Cite This Study

Also Consider

Also Consider