What question did this study set out to answer?

This research aims to develop a transcription system that converts unstructured medical conversations into structured formats, improving LLM training.

June 10, 2026Open Access

From raw audio to structure: an agent-based pipeline that boosts medical LLM performance

Key Points

This research aims to develop a transcription system that converts unstructured medical conversations into structured formats, improving LLM training.
Developed an agent-based transcription framework with Planner, Memory, and Executor modules.
Applied the system to 7197 minutes of Chinese clinical recordings and 240 minutes of English-language dialogues.
Performed controlled comparisons against various models and conducted architectural ablation studies.
Achieved high reconstruction accuracy: 94.7% denoising, 96.9% content correction, 88.6% speaker identification, 92.7% segmentation.
Agent-generated SCT fine-tuning improved quality scores from 3.1 to 3.7 (P < 0.001).
Outperformed traditional methods on HealthBench compared to RUCT fine-tuning and baseline results.

Abstract

Large language models (LLMs) are increasingly applied in clinical communication, yet their reliability depends on high-quality conversational corpora. Real-world doctor–patient recordings are frequently degraded by noise, transcription errors, speaker overlap, and fragmented dialogue structure, limiting their usability for downstream model training. Here, we present an agent-based transcription framework that autonomously converts raw unstructured conversation transcriptions (RUCT) into structured conversation transcriptions (SCT) suitable for LLM fine-tuning. The system integrates three coordinated modules—Planner, Memory, and Executor—to orchestrate noise removal, content correction, speaker identification, and dialogue segmentation within a self-correcting workflow. Applied to 7197 minutes of Chinese clinical recordings across eight departments, with an additional 240 minutes of English-language dialogues used as a limited portability check, the agent achieved high reconstruction accuracy (94.7% denoising, 96.9% content correction, 88.6% speaker identification, 92.7% segmentation) and operated 3.6× faster than manual processing. In controlled comparisons against a cascaded deep-learning pipeline, a sequential non-agent execution, and an end-to-end large-context model, the agent achieved consistently higher performance across all four processing tasks. Architectural ablation further revealed marked degradation when Planner or Memory modules were removed (e.g., up to 47.6% reduction in speaker identification), supporting the contribution of coordinated task decomposition and cross-step state retention. To assess downstream impact, we fine-tuned an independent open-weight model (Qwen3-32B) on agent-generated SCT versus RUCT derived from an identical training set. Agent-generated SCT fine-tuning significantly improved overall quality scores (3.1 to 3.7; P < 0.001; Fleiss’ κ = 0.82) in blinded expert evaluation across six clinically grounded dimensions, and also yielded higher scores on an external medical dialogue benchmark (HealthBench) than both RUCT fine-tuning and the non-fine-tuned baseline. These findings indicate that agent-structured clinical corpora enhance LLM fine-tuning performance and provide a scalable framework for reliable medical conversational AI development.

Mark Helpful

Bookmark

Relay

View Full Paper