Abstract Background Screening for clinical trials is challenging for clinicians due to its time-consuming and repetitive nature. The rise of artificial intelligence (AI) offers an opportunity to improve screening productivity and reproducibility. Pancreatic cancer is characterized by increasing incidence, poor survival outcomes, and an urgent need for improved management strategies. Objective This study aimed to assess the performance of AI in evaluating clinical trial inclusion and exclusion criteria, compared to a double-blind human gold standard, using a retrospective cohort. Methods In the PANCR-AI (Pancreatic Cancer Retrospective Screening with Artificial Intelligence) pilot study, we retrospectively reviewed cases from our institutional database of patients with advanced pancreatic cancer presented at tumor board meetings between January 2018 and December 2023. Each patient was screened for clinical trials open for inclusion at the time of the multidisciplinary meeting. Manual screening of eligibility criteria for each patient-trial pair was performed by 2 blinded oncologists to determine potential eligibility (gold standard), with a third oncologist resolving discrepancies. Potential eligibility was also assessed using 3 large language models (ie, GPT-4.5, Claude 3.7 Sonnet, and Mistral-7B-Instruct v0.3). Their performance was compared to the human gold standard using standard evaluation metrics (eg, sensitivity, specificity, precision, recall, and F 1 -score). Correlations between the risk of failure and the number of words and characters in the criteria were analyzed. The time required to complete the screening was recorded for both human and AI assessments. The number of trials open for enrollment at the time of the tumor board meeting was also recorded as a variable for analysis. Results Across 341 patient-trial pairs, the AI models demonstrated high sensitivity, ranging from 83.3% to 92.2%. Analysis of the criteria showed a correlation between the risk of failure and the number of words and the number of characters in the criteria. Overall screening time for manual assessment was significantly longer for the human gold standard (44.70 hours) assessment than for AI (2.53-3.15 hours). Patients were more likely to have been included in a clinical trial if the number of trials open for enrollment was higher at the time of the tumor board meeting ( P =.02). Conclusions Our study highlights the promising performance of AI in clinical trial screening. Future work should explore integration with structured clinical data, such as laboratory values or radiological findings, to improve multimodal comprehension. Expanding the evaluation to a broader range of tumor types and multicenter datasets would improve generalizability. Finally, real-time prospective validation and workflow integration with electronic health records will be critical to assess the feasibility and clinical impact of large language model–assisted screening in daily oncology practice. Addressing these challenges will be essential to move from proof of concept to scalable clinical implementation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Arthur Claessens
Alizée Simon
Agathe Manchart
JMIR Cancer
Building similarity graph...
Analyzing shared references across papers
Loading...
Claessens et al. (Mon,) studied this question.
www.synapsesocial.com/papers/699e9177f5123be5ed04ef7d — DOI: https://doi.org/10.2196/80268
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: