What question did this study set out to answer?

The study aims to explore and identify the diversity of DNA-based papillomaviruses using extensive sequencing data.

April 22, 2026Open Access

Petabase-scale Papillomavirus Discovery

Key Points

The study aims to explore and identify the diversity of DNA-based papillomaviruses using extensive sequencing data.
Completed sequence assembly and compression of 27 million sequencing libraries from the Sequence Read Archive.
Conducted alignment-based searches to identify known and novel papillomaviruses.
Integrated virus phylogeny and ecological data to study PV diversity hotspots.
Re-identified 65% of known papillomaviruses in the NCBI Virus database.
Expanded diversity by 34%, discovering 383 novel PV types across 105 host species.
Identified significant hotspots of PV novelty in East Africa and South America.

Abstract

Freely available nucleic acid sequencing databases have accumulated to a vast archive of genetic diversity, in excess of 50 petabase-pairs from tens of millions of experiments. Together, these data create a digital survey of our planet’s ecosystems.. However, the richness of biological information contained within these repositories remains largely unexplored, in large part owing to the technical challenges of analyzing petabytes of data. Recently, Logan completed the sequence-assembly and compression of 27 million sequencing libraries from the Sequence Read Archive (SRA), and here, we systematically search the SRA to reveal the global diversity of the DNA-based Papillomaviruses (PVs). Using alignment-based searches against the Logan assemblage, we independently re-identified 65% of the 992 PVs recorded within the NCBI Virus database, a body of work representing over five decades of PV characterization, within a ~10-hour computational search. In addition, we expand the diversity of PVs by 34%, identifying 383 novel PV types spanning 105 associated host species, including taxa with no previously characterized PVs, such as rhinoceros, voles, and grey foxes. Through integration of virus phylogeny, sample geography, and ecological metadata, we show that novel PV discovery is not directly proportional to sampling effort, and there are significant hotspots of PV novelty in East Africa and South America, and undersampled biomes can yield disproportionately more PV biodiversity. These results establish that repositories such as the SRA contain vast, unrealized biological information that is now accessible at scale through advanced computational infrastructure, transforming passive data archives into a living library for virology, and biology at large.

Petabase-scale Papillomavirus Discovery

Key Points

Abstract

Cite This Study