What question did this study set out to answer?

The aim is to create a validated, structured knowledge base for constructing executable analysis workflows in Bioconductor.

June 11, 2026Open Access

BioMate-KB: A Real-Execution-Validated Workflow Knowledge Base for Bioconductor

Key Points

The aim is to create a validated, structured knowledge base for constructing executable analysis workflows in Bioconductor.
Extracted 15,641 workflow steps from 2,241 Bioconductor packages, validating each through structure checks and real execution.
Implemented dual-agent LLM review for quality assurance, achieving 90% correctness across workflows.
Annotated steps with EDAM ontology and linked to BioContainers and software DOIs.
732 workflows validated through real execution yielded specific outputs with enhanced QC metadata.
The study demonstrates that less than half of head workflows complete successfully on initial attempts, contradicting prior assumptions about runnability.
The top-100 package skill bundle is available to the public under a Creative Commons license.

Abstract

Bioconductor hosts more than 2,200 packages for statistical genomics, yet constructing correct, executable analysis workflows from them remains labor-intensive because no structured, step-level knowledge resource exists at ecosystem scale. We present BioMate-KB, a curated knowledge base of 15,641 workflow steps extracted from ~2,241 Bioconductor 3.20 packages, in which workflows are first structure-validated against NAMESPACE-exported function APIs and then validated by real execution — running each workflow end-to-end in dependency-complete environments on synthesized realistic inputs and asserting that declared outputs are produced. Steps are annotated with EDAM ontology, linked to BioContainers images, and enriched with software DOIs and vignette-source provenance; a dual-agent LLM review confirms 90.0% step correctness (κ = 0.96) across seven domains. The distinguishing contribution is a principled separation between structurally well-formed and actually runnable workflows: a static parse gate proves a workflow is well-formed but not that it runs, and real execution reveals that fewer than half of attempted head workflows complete on first attempt — so "indexed" or "dry-run-validated" substantially overstates runnability. We reframe executability as an empirical, reproducible property recorded as a first-class database field. 732 head workflows are real-execution-validated and enriched with per-step visualization and QC metadata, and BioMate routes natural-language queries only to this validated set. The public top-100 package skill bundle is freely available under CC-BY-4.0.

BioMate-KB: A Real-Execution-Validated Workflow Knowledge Base for Bioconductor

Key Points

Abstract

Cite This Study

Also Consider

Also Consider