Background Large language models (LLMs) are rapidly entering respiratory medicine workflows. Their clinical role remains unclear. A central concern is whether they function as autonomous decision makers or as tools that depend on clinician input. Methods We conducted a structured narrative review across three domains aligned with respiratory practice. Bedside clinical decision support and triage, pulmonary function test (PFT)/spirometry interpretation, and chest X-ray (CXR) reporting. We included quantitative studies comparing LLMs or specialised AI systems with clinician performance, and extracted accuracy, agreement, and safety-critical failure modes. Results At bedside, LLMs approached expert diagnosis in straightforward cases, but performance fell in highly-complex or multi-step management tasks. The dominant error was omission of required actions, unsafe under-triage or incomplete plans. In PFTs, guideline-anchored and spirogram-trained LLMs achieved high agreement and reduced inter-observer variability on structured inputs. In contrast, generalist LLMs showed only moderate reliability and inconsistent handling of borderline or mixed patterns. In CXR reporting, chest-specific multimodal systems outperformed generalist LLMs, with fewer hallucinations and better clinical acceptability; yet clinically relevant discrepancies persisted without radiologist review. Conclusions Across domains, AI does not replace respiratory expertise. When inputs are complete, structured, and guideline-concordant, outputs stabilise and add value. When inputs are ambiguous or incomplete, errors expand, often through dangerous omissions. Safe deployment therefore requires specialised training, strict input structure, and continuous clinician oversight.
Porcella et al. (Thu,) studied this question.