Freely available nucleic acid sequencing databases have accumulated to a vast archive of genetic diversity, in excess of 50 petabase-pairs from tens of millions of experiments. Together, these data create a digital survey of our planet’s ecosystems.. However, the richness of biological information contained within these repositories remains largely unexplored, in large part owing to the technical challenges of analyzing petabytes of data. Recently, Logan completed the sequence-assembly and compression of 27 million sequencing libraries from the Sequence Read Archive (SRA), and here, we systematically search the SRA to reveal the global diversity of the DNA-based Papillomaviruses (PVs). Using alignment-based searches against the Logan assemblage, we independently re-identified 65% of the 992 PVs recorded within the NCBI Virus database, a body of work representing over five decades of PV characterization, within a ~10-hour computational search. In addition, we expand the diversity of PVs by 34%, identifying 383 novel PV types spanning 105 associated host species, including taxa with no previously characterized PVs, such as rhinoceros, voles, and grey foxes. Through integration of virus phylogeny, sample geography, and ecological metadata, we show that novel PV discovery is not directly proportional to sampling effort, and there are significant hotspots of PV novelty in East Africa and South America, and undersampled biomes can yield disproportionately more PV biodiversity. These results establish that repositories such as the SRA contain vast, unrealized biological information that is now accessible at scale through advanced computational infrastructure, transforming passive data archives into a living library for virology, and biology at large.
Shen et al. (Mon,) studied this question.