July 31, 2025Open Access

Can large language models reliably extract human disease genes from full-text scientific literature?

Key Points

MAIN FINDING: Large language models can reliably extract gene-disease-phenotype information from literature with 88.8% accuracy.
KEY EVIDENCE: GPT-4 achieved 100% accuracy in gene name extraction and strong performance in disease and phenotype fields.
APPROACH: The study benchmarked performance of three zero-shot prompted LLMs on genetic information extraction.
SIGNIFICANCE: This offers a scalable solution for maintaining up-to-date genetic databases, easing researchers' workload.

Abstract

Manual extraction of high-fidelity gene-disease-phenotype information from human genetics literature is a labor-intensive task that requires trained human genetics researchers to read through many primary research papers. This presents a major challenge for maintaining up-to-date human disease genetic databases. Recent exploration into large language models (LLMs) opens new directions in automating this manual process. However, most approaches depend on pre-training, finetuning, or specialized generative artificial intelligence (GenAI) tools, but there is a lack of empirical evidence to show whether commercially-available LLMs can be directly used to reliably extract gene-disease-phenotype for human genetic diseases. Herein, we perform a benchmark of the use of three zero-shot prompted LLMs, namely GPT-4, DeepSeek and Claude, without task-specific fine-tuning, to extract human genetic information directly from full text of scientific papers. Using known congenital heart diseases (CHD) genes found in the open access CHDgene database (https://chdgene.victorchang.edu.au/) as the benchmark data set, GPT-4o achieved overall 88.8% extraction accuracy across 23 gene entries containing over 57 references, with 100% accuracy in gene name, 78.3% and 76.7% in disease and phenotype fields respectively. This work introduces a lightweight, easy-to-deploy, and yet robust LLM-based agent named GeneAgent, analyze sources of disagreement, and highlight the feasibility of integrating powerful LLM into genetic evidence synthesis workflows.

Can large language models reliably extract human disease genes from full-text scientific literature?

Key Points

Abstract

Cite This Study