What type of study is this?

September 10, 2025Open Access

Using Large Language Models to Extract Structured Data from Health Coaching Dialogues: A Comparative Study of Code Generation Versus Direct Information Extraction

Key Points

LLM-based models for data extraction showed varying accuracy, with ChatGPT achieving 100%.
Pattern-matching functions demonstrated a speed advantage, processing items in an average of 10 milliseconds.
Real coaching dialogues provided a data set that informs both accuracy and performance in model training.
These findings point toward future research possibilities blending extraction methods to enhance coaching systems.

Abstract

Background: Virtual coaching can help people adopt new healthful behaviors by encouraging them to set specific goals and helping them review their progress. One challenge in creating such systems is analyzing clients’ statements about their activities. Limiting people to selecting among predefined answers detracts from the naturalness of conversations and user engagement. Large Language Models (LLMs) offer the promise of covering a wide range of expressions. However, using an LLM for simple entity extraction would not necessarily perform better than functions coded in a programming language, while creating higher long-term costs. Methods: This study uses a real data set of annotated human coaching dialogs to develop LLM-based models for two training scenarios: one that generates pattern-matching functions and the other which does direct extraction. We use models of different sizes and complexity, including Meta-Llama, Gemma, and ChatGPT, and calculate their speed and accuracy. Results: LLM-generated pattern-matching functions took an average of 10 milliseconds (ms) per item as compared to 900 ms. (ChatGPT 3.5 Turbo) to 5 s (Llama 2 70B). The accuracy for pattern matching was 99% on real data, while LLM accuracy ranged from 90% (Llama 2 70B) to 100% (ChatGPT 3.5 Turbo), on both real and synthetically generated examples created for fine-tuning. Conclusions: These findings suggest promising directions for future research that combines both methods (reserving the LLM for cases that cannot be matched directly) or that use LLMs to generate synthetic training data with more expressive variety which can be used to improve the coverage of either generated codes or fine-tuned models.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper