February 14, 2024Open Access

An Entity Extraction Pipeline for Medical Text Records Using Large Language Models: Analytical Study

Key Points

Key points are not available for this paper at this time.

Abstract

Background The study of disease progression relies on clinical data, including text data, and extracting valuable features from text data has been a research hot spot. With the rise of large language models (LLMs), semantic-based extraction pipelines are gaining acceptance in clinical research. However, the security and feature hallucination issues of LLMs require further attention. Objective This study aimed to introduce a novel modular LLM pipeline, which could semantically extract features from textual patient admission records. Methods The pipeline was designed to process a systematic succession of concept extraction, aggregation, question generation, corpus extraction, and question-and-answer scale extraction, which was tested via 2 low-parameter LLMs: Qwen-14B-Chat (QWEN) and Baichuan2-13B-Chat (BAICHUAN). A data set of 25,709 pregnancy cases from the People’s Hospital of Guangxi Zhuang Autonomous Region, China, was used for evaluation with the help of a local expert’s annotation. The pipeline was evaluated with the metrics of accuracy and precision, null ratio, and time consumption. Additionally, we evaluated its performance via a quantified version of Qwen-14B-Chat on a consumer-grade GPU. Results The pipeline demonstrates a high level of precision in feature extraction, as evidenced by the accuracy and precision results of Qwen-14B-Chat (95.52% and 92.93%, respectively) and Baichuan2-13B-Chat (95.86% and 90.08%, respectively). Furthermore, the pipeline exhibited low null ratios and variable time consumption. The INT4-quantified version of QWEN delivered an enhanced performance with 97.28% accuracy and a 0% null ratio. Conclusions The pipeline exhibited consistent performance across different LLMs and efficiently extracted clinical features from textual data. It also showed reliable performance on consumer-grade hardware. This approach offers a viable and effective solution for mining clinical research data from textual records.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Lei Wang

Yinyao Ma

Wenshuai Bi

Journals

Journal of Medical Internet Research

Actions

Institutions

BGI Group (China)

Center for Life Sciences

The People's Hospital of Guangxi Zhuang Autonomous Region

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

An Entity Extraction Pipeline for Medical Text Records Using Large Language Models: Analytical Study

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study