Objective: To evaluate whether large language models (LLMs) can autonomously synthesize existing literature and accurately extract prognostic variables for neonatal intraventricular hemorrhage (IVH) and its outcomes while assessing their capability for clinical feature ranking and risk stratification. Study Design: This pilot study employed a systematic literature review combined with retrieval augmented generation (RAG) methodology. GPT 4 (OpenAI) and Claude Sonnet (4.0, Anthropic) were prompted to identify peer-reviewed studies utilizing machine learning and deep learning to predict IVH outcomes in preterm neonates. Data extraction was prompted to follow TRIPOD AI guidelines, capturing study design, population characteristics, predictor variables, and outcome measures. Semi-automated RAG extraction was performed with manual validation to mitigate hallucination risk. Results: LLMs initially identified 39 studies, with 28 meeting some or all the validation criteria after excluding references that were hallucinated. From these, 14 distinct prognostic predictors were extracted across four outcome domains: mortality, progression, complications, and resolution. Universal high-impact predictors included gestational age (13 mentions; 41%), birth weight (8 mentions, 25%), and APGAR scores (11 mentions, 34%). Variables were categorized into 3 clinical tiers based on frequency, outcome breadth, and modifiability. A preliminary risk stratification model demonstrated high-risk neonates (70%, and mortality >50%, while low-risk neonates (>32 weeks, >1500g, APGAR>5) showed favorable trajectories. Conclusions: This study demonstrates that LLMs can synthesize medical literature and extract clinically relevant prognostic variables for neonatal IVH outcomes. However, LLM outputs were susceptible to hallucinations and incomplete data synthesis, underscoring the need for rigorous clinical oversight and human validation to ensure reliability. The identified universal predictors provide a foundation for developing AI-assisted clinical decision support tools. Notable research gaps include the complete absence of resolution prediction studies and limited investigation of complication predictors, highlighting opportunities for future investigation in precision neonatology.
Arora et al. (Mon,) studied this question.