The gpt-oss20b and gpt-oss120b large language models demonstrated strong reliability (Cohen's kappa >0.8) in automating oncology clinical trial eligibility screening for explicit criteria.
Do large language models accurately automate clinical trial eligibility screening in oncology?
Advanced LLMs like gpt-oss20b and gpt-oss120b show high concordance for explicit oncology trial eligibility criteria, offering a scalable approach to automate patient-trial matching.
Absolute Event Rate: 0% vs 0%
Abstract Background: Efficient patient-trial matching remains a critical challenge in oncology, complicated by heterogeneous documentation, missing data, and complex eligibility criteria. Large Language Models (LLMs) offer potential to automate eligibility screening by interpreting unstructured clinical notes and biomarker data. Methods: We evaluated 6 models: llama3. 2: 3b, llama3. 3: 70b, medgemma₂7bₜextᵢt, deepseek-r1: 8b, gpt-oss20b and gpt-oss120b for clinical trial eligibility determination across 19 key questions reflecting common eligibility criteria from oncology clinical trials. Data were extracted from patient medical records with known trial matches, and models’ binary (yes/no) responses, confidence scores, and reasoning excerpts were analyzed. Concordance between models and interpretability of outputs were assessed. Results: Both gpt-oss20b and gpt-oss120b models demonstrated high agreement on eligibility determinations for well-documented criteria such as measurable disease, ECOG status, age, and tissue availability, with confidence scores commonly above 0. 90. Differences emerged in criteria requiring inference or where documentation was incomplete; gpt-oss120b showed greater confidence and nuanced reasoning in ambiguous cases. Both models flagged missing or unclear data, providing reasoning transparency that supports clinical review. Concordance metrics suggested strong reliability (Cohen’s kappa 0. 8) for explicit criteria, with potential to significantly reduce manual screening burden. The remaining models provided poorer quality responses in general and were unable to respond coherently at all if required to provide that response in a structured format. Conclusions: LLMs can accurately and transparently automate critical components of oncology trial eligibility screening, augmenting manual review processes. Differences in model confidence with uncertain data underscore the need for ongoing refinement and highlight the value of explainable AI in clinical decision support. These findings support integrating LLMs into clinical trial matching workflows to improve trial access and enrollment efficiency. Impact: Automated, interpretable LLM-based clinical trial matching represents a promising advancement toward precision oncology by scaling patient access to tailored therapies and optimizing trial throughput. Citation Format: Aakash Desai, Ellen McNeeley, Sanad Alhuski, Maya Khalil, Matthew Might, Rebecca Arend, Andrew Crouse, Mehmet Akce,. Evaluation of large language models for automated clinical trial matching in oncology abstract. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts) ; 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86 (7 Suppl): Abstract nr 2739.
Desai et al. (Fri,) reported a other. The gpt-oss20b and gpt-oss120b large language models demonstrated strong reliability (Cohen's kappa >0.8) in automating oncology clinical trial eligibility screening for explicit criteria.