What question did this study set out to answer?

This review aims to evaluate the role of large language models in respiratory care workflows and their safety implications.

May 30, 2026Open Access

Large Language Models in Respiratory Care: Safety and Clinical Integration

Key Points

This review aims to evaluate the role of large language models in respiratory care workflows and their safety implications.
Conducted a structured narrative review across clinical decision support, PFT interpretation, and CXR reporting.
Included quantitative studies comparing LLMs to clinician performance for accuracy and safety failures.
Assessed performance through expert diagnosis, reliability in PFTs, and error rates in CXR reporting.
LLMs performed well in straightforward bedside diagnoses but struggled with complex management tasks.
Guideline-anchored LLMs showed high agreement in PFTs, while generalist models displayed only moderate reliability.
Chest-specific multimodal systems outperformed generalist LLMs in CXR reporting but discrepancies remained without radiologist input.

Abstract

Background Large language models (LLMs) are rapidly entering respiratory medicine workflows. Their clinical role remains unclear. A central concern is whether they function as autonomous decision makers or as tools that depend on clinician input. Methods We conducted a structured narrative review across three domains aligned with respiratory practice. Bedside clinical decision support and triage, pulmonary function test (PFT)/spirometry interpretation, and chest X-ray (CXR) reporting. We included quantitative studies comparing LLMs or specialised AI systems with clinician performance, and extracted accuracy, agreement, and safety-critical failure modes. Results At bedside, LLMs approached expert diagnosis in straightforward cases, but performance fell in highly-complex or multi-step management tasks. The dominant error was omission of required actions, unsafe under-triage or incomplete plans. In PFTs, guideline-anchored and spirogram-trained LLMs achieved high agreement and reduced inter-observer variability on structured inputs. In contrast, generalist LLMs showed only moderate reliability and inconsistent handling of borderline or mixed patterns. In CXR reporting, chest-specific multimodal systems outperformed generalist LLMs, with fewer hallucinations and better clinical acceptability; yet clinically relevant discrepancies persisted without radiologist review. Conclusions Across domains, AI does not replace respiratory expertise. When inputs are complete, structured, and guideline-concordant, outputs stabilise and add value. When inputs are ambiguous or incomplete, errors expand, often through dangerous omissions. Safe deployment therefore requires specialised training, strict input structure, and continuous clinician oversight.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper