June 3, 2024Open Access

Fine-tuning large language models for rare disease concept normalization

Key Points

Key points are not available for this paper at this time.

Abstract

Abstract Objective We aim to develop a novel method for rare disease concept normalization by fine-tuning Llama 2, an open-source large language model (LLM), using a domain-specific corpus sourced from the Human Phenotype Ontology (HPO). Methods We developed an in-house template-based script to generate two corpora for fine-tuning. The first (NAME) contains standardized HPO names, sourced from the HPO vocabularies, along with their corresponding identifiers. The second (NAME+SYN) includes HPO names and half of the concept’s synonyms as well as identifiers. Subsequently, we fine-tuned Llama 2 (Llama2-7B) for each sentence set and conducted an evaluation using a range of sentence prompts and various phenotype terms. Results When the phenotype terms for normalization were included in the fine-tuning corpora, both models demonstrated nearly perfect performance, averaging over 99% accuracy. In comparison, ChatGPT-3.5 has only ∼20% accuracy in identifying HPO IDs for phenotype terms. When single-character typos were introduced in the phenotype terms, the accuracy of NAME and NAME+SYN is 10.2% and 36.1%, respectively, but increases to 61.8% (NAME+SYN) with additional typo-specific fine-tuning. For terms sourced from HPO vocabularies as unseen synonyms, the NAME model achieved 11.2% accuracy, while the NAME+SYN model achieved 92.7% accuracy. Conclusion Our fine-tuned models demonstrate ability to normalize phenotype terms unseen in the fine-tuning corpus, including misspellings, synonyms, terms from other ontologies, and laymen’s terms. Our approach provides a solution for the use of LLMs to identify named medical entities from clinical narratives, while successfully normalizing them to standard concepts in a controlled vocabulary.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Andy Wang

Peddie School

Cong Liu

Oregon Health & Science University

Jingye Yang

Hebei University of Engineering

Journals

Journal of the American Medical Informatics Association

Actions

Institutions

University of Pennsylvania

Columbia University

Peddie School

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Fine-tuning large language models for rare disease concept normalization

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider