Bioconductor hosts more than 2,200 packages for statistical genomics, yet constructing correct, executable analysis workflows from them remains labor-intensive because no structured, step-level knowledge resource exists at ecosystem scale. We present BioMate-KB, a curated knowledge base of 15,641 workflow steps extracted from ~2,241 Bioconductor 3.20 packages, in which workflows are first structure-validated against NAMESPACE-exported function APIs and then validated by real execution — running each workflow end-to-end in dependency-complete environments on synthesized realistic inputs and asserting that declared outputs are produced. Steps are annotated with EDAM ontology, linked to BioContainers images, and enriched with software DOIs and vignette-source provenance; a dual-agent LLM review confirms 90.0% step correctness (κ = 0.96) across seven domains. The distinguishing contribution is a principled separation between structurally well-formed and actually runnable workflows: a static parse gate proves a workflow is well-formed but not that it runs, and real execution reveals that fewer than half of attempted head workflows complete on first attempt — so "indexed" or "dry-run-validated" substantially overstates runnability. We reframe executability as an empirical, reproducible property recorded as a first-class database field. 732 head workflows are real-execution-validated and enriched with per-step visualization and QC metadata, and BioMate routes natural-language queries only to this validated set. The public top-100 package skill bundle is freely available under CC-BY-4.0.
Yaoyun Zhang (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: