Abstract Background The rapid adoption of ChatGPT in healthcare has generated extensive literature. However, systematic analysis of this emerging field using artificial intelligence (AI)‐powered tools remains challenging due to the volume and diversity of publications and the risk of AI‐generated hallucinations that compromise factual accuracy. Objective We developed ChatCM‐RAG, a deep learning pipeline integrating BERTopic with transformer‐based retrieval‐augmented generation to analyse ChatGPT applications in the Medicine literature. Methods We processed 904 peer‐reviewed articles (2022–2025) using a multi‐stage pipeline: BERTopic for topic modelling with UMAP dimensionality reduction and HDBSCAN clustering, Facebook AI Similarity Search for semantic retrieval, and transformer models (T5/GPT) for answer generation. The system was evaluated using representative medical queries across retrieval accuracy, generation quality, and system efficiency metrics. Results ChatCM‐RAG identified four distinct topic clusters: general medical AI applications (46.2%), performance evaluation (13.9%), clinical applications and patient care (12.6%), and chatbot implementations (6.0%), with 21.2% unclustered documents. To reduce hallucination and ensure citation authenticity, the generation module is constrained to a curated corpus of 904 PubMed‐indexed documents and only permits PMID citations that exist in the retrieval set. In our pilot evaluation on eight representative medical queries, we observed 0 fabricated PMIDs and 0 non‐existent citations in generated answers. The system achieved an average response time of 1.73 s, an answer quality score of 0.81, and demonstrated topic‐aware retrieval with a 0.90 relevance score, where 73% of retrieved documents originated from topically appropriate clusters. The ChatCM‐RAG model has been open‐sourced at https://huggingface.co/fc28/ChatCM‐RAG . Additional reproducible analyses were performed using the released dataset ( n = 904) and code to quantify topic interpretability and retrieval robustness without requiring external LLM calls. Using an 80/20 held‐out title‐query evaluation, topic‐filtered lexical retrieval (TF‑IDF within predicted cluster) improved Cluster‐agreement@5 from 0.599 ± 0.309 (global TF‑IDF) to 0.862 ± 0.345. Top TF‐IDF terms were identified for each cluster, and retrieval performance was assessed using a held‐out evaluation. Conclusions ChatCM‐RAG effectively synthesises large‐scale medical literature, revealing that ChatGPT applications in medicine are dominated by exploratory studies with an emerging focus on clinical decision support. The open‐source pipeline provides researchers with powerful tools for understanding AI integration in traditional medicine.
Zhang et al. (Mon,) studied this question.