Abstract Over the past decade, central sequence repositories have expanded significantly in size. This vast accumulation of data holds value and enables further studies, provided that the data entries are well annotated. However, the submitter-provided metadata of sequencing records can be of heterogeneous quality, presenting significant challenges for re-use. Here, we test to what extent large language models (LLMs) can be used to cost-effectively automate the re-annotation of sequencing records against a simplified classification scheme of broad ecological environments with relevance to microbiome studies, without fine-tuning. This effort directly contributes to improving the FAIRness—Findability, Accessibility, Interoperability, and Reusability—of microbiome sequencing metadata, thereby enhancing their “AI readiness” for downstream computational analyses. We focused on sequencing samples taken from the environment, for which metadata is important. We employed OpenAI Generative Pre-trained Transformer (GPT) models, and assessed scalability, time- and cost-effectiveness, as well as performance against a diverse, hand-curated benchmark with 1,000 examples, that span a wide range of complexity in metadata interpretation. Annotation performance markedly outperformed that of a baseline, manually curated, non-ML keyword-based approach. Changing models (or model parameters) has only minor effects on performance, but prompts need to be carefully designed to match the task. Furthermore, when we compared proprietary OpenAI models with open-weight alternatives (e.g., Qwen, meta-Llama, and microsoft-phi-4), we found comparable accuracy for both biome and sub-biome classification, indicating that open-weight architectures can match the performance of proprietary models for large-scale ecological metadata re-annotation. We validated the pipeline with 1,000 hand-curated samples, and we applied the optimized pipeline to 2 million sequencing records from the environment, providing coarse-grained yet standardized sample origin annotations covering the globe. Our work demonstrates the effective use of LLMs to simplify and standardize annotation from complex biological metadata.
Gaio et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: