Key points are not available for this paper at this time.
BACKGROUND: Large language models (LLMs) offer promising tools for patient education, yet fixed knowledge cutoffs and hallucination risk limit their clinical utility. Current retrieval-augmented generation (RAG) approaches fail to distinguish between stable clinical knowledge and evolving recommendations. METHODS: We developed and evaluated bRAGgen, a temporally anchored RAG framework incorporating five modules to enforce clinical protocols for MBS patient education: a semantic knowledge cache, multi-source evidence retrieval with graph-based fusion, uncertainty-aware generation, clinical constraint reranking, and Temporal Fisher Anchoring with Mechanism Selectivity (TFAMS) for adaptive inference. The framework was evaluated using 105 expert-curated free-response questions assessed by a multinational panel of seven specialists (5 surgeons, 2 dietitians) from five countries on a 5-point Likert scale for factuality, clinical relevance, and comprehensiveness. LLM-as-Judge evaluation using ChatGPT-4o provided complementary automated assessment. RESULTS: bRAGgen significantly improved response quality across all five base language models tested (p < 0.001), with large effect sizes for higher-capacity models (Cohen's d = 0.96-1.01) and moderate effects for smaller models (Cohen's d = 0.38-0.56) with good inter-rater reliability (Krippendorff's α = 0.72). The largest gains occurred in safety-critical categories including Risks and Complications (+ 1.84 points) and Mental and Emotional Health (+ 1.84 points), suggesting the framework is most impactful where nuanced clinical judgment is essential. LLM-as-Judge evaluation using ChatGPT-4o demonstrated high concordance with expert ratings (Spearman's ρ = 0.94). CONCLUSIONS: This proof-of-concept study suggests that a multi-module RAG framework with temporal stability anchoring can improve expert-rated LLM response quality for bariatric surgery domain knowledge, though prospective validation in patient-facing settings is needed before clinical implementation.
Atri et al. (Wed,) studied this question.