Key points are not available for this paper at this time.
Introduction: Automated documentation tools are being rapidly adopted in healthcare and clinical workflows. Among these are AI-enabled ambient scribing products, which transcribe conversations between patients and healthcare providers, then produce clinical records using automatic speech recognition (ASR) and generative AI such as Large Language Models (LLMs). While research suggests these technologies can reduce clinical burden, safe and responsible deployment requires that these tools determine what captured information is appropriate to record and under which circumstances. This presents a contextual privacy challenge distinct from PII leakage or data memorization and remains largely untested. Methods: We address this gap by operationalizing privacy leakage as the inappropriate inclusion of third-party personal information in LLM-generated clinical notes. We construct a benchmark of transcripts containing private information with gold standard clinical notes by enriching patient metadata from the aci-bench corpus and injecting third-party personal information across six relationship types and seven information topics. We evaluate open weight LLaMA 3.1 8 and 70 B, Mixtral 8×7B and 8×22B, and proprietary Claude 3.5 Haiku and Sonnet models on note generation using prompts with varied privacy and structural requirements. Results: All examined models leaked third-party information, and privacy instructions helped reduce leakage but proved neither complete nor robust as a solution. Models could generate privacy-infringing notes despite correctly identifying such information as inappropriate to share. Decomposing generation and privacy editing into separate steps could further reduce leakage, but only when privacy was defined with contextual specificity. Discussion: No single mitigation eliminated leakage entirely, but combining approaches yielded the greatest reductions. Results emphasize the need to build privacy-by-design systems and develop evaluation strategies that reflect emerging information synthesis and sharing practices.
Chim et al. (Tue,) studied this question.