What question did this study set out to answer?

This research aims to create a standardized platform for assessing automated entity extraction in radiology. It focuses on improving accuracy and comparison across institutions.

April 16, 2026Open Access

RAVEN: An Open-Source Assessment Framework for Automated Entity Extraction in Structured Radiology Reporting

Key Points

This research aims to create a standardized platform for assessing automated entity extraction in radiology. It focuses on improving accuracy and comparison across institutions.
Developed an interactive, web-based, FHIR-compatible assessment platform.
Evaluated 119 chest X-ray reports using a structured 45-field template.
Utilized two models (Mistral-Small-3.2-24B and Llama-3.3-70B) for automated extraction.
Measured metrics including acceptance rates, error taxonomy, and Matthews Correlation Coefficient.
Conducted statistical analyses using paired t-tests and Wilcoxon signed-rank tests.
Overall acceptance rate for extractions was 97.5%, decreasing to 76.0% for non-empty fields.
Mistral-Small-3.2-24B significantly outperformed Llama-3.3-70B with a mean difference of 0.098.
Content errors comprised 84.8% of rejections, primarily due to missing information.
Median extraction time for Mistral was 104.8 seconds and rating time was 111.5 seconds per report.
Template coverage averaged 93.9% across evaluated fields.

Abstract

To establish an open-source FHIR-compatible platform to standardize assessment of automated entity extraction for structured radiological reporting. a modular web-based platform featuring an interactive assessment interface for automated entity extraction from free-text radiological reports was developed including an 11-category error taxonomy. 119 chest X-ray reports were included through a 45-field structured template (39 single-choice, 6 free-text keys) using zero-shot prompting with Mistral-Small-3.2-24B and Llama-3.3-70B. Two medical researchers evaluated 5,355 extracted items, assigning acceptance ratings with error classifications and corrections through consensus adjudication. Processing and rating times, field applicability, acceptance rates, error class distributions, text coverage, Matthews Correlation Coefficient (MCC) for 31 binary-classifiable keys were measured. Two models were compared using paired t-tests, Wilcoxon signed-rank tests, paired-difference linear mixed model (LMM) accounting for patient-level clustering. All extractions produced valid schema-conformant outputs. Median extraction time was 104.8s (Mistral). Median rating time was 111.5s per report. Field applicability rates varied across fields (median 19.3%). Overall acceptance rate was 97.5%, decreasing to 76.0% for non-empty fields. Content errors dominated (84.8% of rejected extractions), primarily missing information (55.0%; 3.8% of all extractions) and unsupported inferences (17.3%). Template coverage averaged 93.9%. MCC was 0.799. Mistral-Small-3.2-24B significantly outperformed Llama-3.3-70B (LMM: mean difference=0.098, 95% CI 0.079, 0.117, p<0.001; paired t-test: p<0.001). Our open-source, FHIR-compatible platform provides infrastructure for retrospective validation and systematic evaluation of automated entity extraction methods in radiological reports, enabling reproducible cross-institutional comparison of extraction. Processing latencies under research-grade hardware (median 104.8s for Mistral-Small-3.2-24B) reflect infrastructure constraints rather than fundamental limitations; clinical deployment requires further optimization beyond this pilot study. • Open-source assessment platform for standardized evaluation of automated radiological report entity extraction • 11-category error taxonomy enables systematic failure analysis of NLP-extractions • Smaller model (Mistral-24B) outperformed larger model (Llama-70B) by 22.6% in feasibility study on German chest X-Ray reports with zero shot prompting • Modular architecture enables reproducible cross-institutional benchmarking

Read Full Paperexternally

Demander à l'IA

Bookmark

View Full Paper