What question did this study set out to answer?

The aim is to develop a framework that improves the prescreening process for clinical trial eligibility using advanced retrieval techniques.

May 30, 2026

Beyond rule-based matching: A locally deployable retrieval-augmented framework for clinical trial prescreening.

Key Points

The aim is to develop a framework that improves the prescreening process for clinical trial eligibility using advanced retrieval techniques.
Validated Retrieval-Augmented Generation (RAG) architecture on TREC Clinical Trials dataset of 375,581 trials.
Implemented a four-stage framework that includes semantic retrieval, keyword retrieval, reciprocal rank fusion, and cross-encoder reranking.
Tested the framework on a pooled oncology cohort of 57 patients.
Achieved a Precision@10 of 0.47, a 67.8% improvement over the standard keyword search baseline (BM25) of 0.24.
Precision@3 was recorded at 0.50.
Local deployment preserved patient privacy with zero Patient Health Information leakage.

Abstract

e13678 Background: Enrollment in clinical trials is critically hindered by the administrative burden of manually matching patient histories with unstructured eligibility criteria. resulting in the frequent overlooking of eligible patients, stagnating adult cancer patient enrollment at only 5–7%. Traditional keyword-based search fails to capture semantic nuance, and standard AI models are limited by their static training data. In this study, we validate the Retrieval-Augmented Generation (RAG) architecture that dynamically retrieves the clinical trial protocols relevant to a patient's case and integrates this external text into our locally deployed Cross-Encoder model for analysis. This allows the system to ground its eligibility assessments in a real-time review of trial criteria, ensuring decisions reflect current protocols rather than relying on the static, often outdated knowledge memorized during model training. Methods: We utilized the TREC Clinical Trials dataset (2021-22) of 375,581 trials, to validate the RAG architecture on a pooled oncology cohort of 57 patients. We implemented a four stage framework , which each level designed to resolve a specific barrier to automated matching: Semantic Retrieval (S-PubMedBert): Captures conceptual clinical meaning (e.g., linking "Lung Cancer" to "SCLC") independent of exact word overlap. Keyword Retrieval (BM25): Enforces strict matching for critical molecular identifiers (e.g., "HER2-positive") to prevent false positives. Reciprocal Rank Fusion: Synthesizes semantic and keyword signals to prioritize trials with high confidence from both methods. Cross-Encoder Reranking (BGE-Reranker-v2-M3): Performs a final assessment of top candidates to verify complex eligibility logic (e.g., exclusion criteria) akin to human chart reviews. Results: The framework achieved a retrieval Precision@10 of 0.47, representing a 67.8% relative improvement over a standard keyword search baseline (BM25) of 0.24, and Precision@3 of 0.50. While cloud-based LLMs may achieve marginally higher precision, they are deployable only in non-HIPAA environments; our system achieves comparable utility with zero Patient Health Information leakage. Conclusions: We conclude that a locally hosted lightweight Retrieval-Augmented architecture could serve as a superior alternative to standard keyword based trial matching, balancing specialist-level precision with data privacy. This framework effectively eliminates the manual screening bottleneck, establishing a scalable foundation for automated, real-time patient recruitment. Metric RAG BM25* Relative Improvement Precision@3* 0.50 0.33 +51.6% Precision@10* 0.47 0.28 +67.8% NDCG@10* 0.67 0.31 +116% Precision@k – Proportion of relevant trials identified within the top k results. NDCG@k – Normalized Discounted Cumulative Gain at k . BM25 – Best Matching 25 (Standard keyword search baseline).

Mark Helpful

Bookmark

Relay