To establish an open-source FHIR-compatible platform to standardize assessment of automated entity extraction for structured radiological reporting. a modular web-based platform featuring an interactive assessment interface for automated entity extraction from free-text radiological reports was developed including an 11-category error taxonomy. 119 chest X-ray reports were included through a 45-field structured template (39 single-choice, 6 free-text keys) using zero-shot prompting with Mistral-Small-3.2-24B and Llama-3.3-70B. Two medical researchers evaluated 5,355 extracted items, assigning acceptance ratings with error classifications and corrections through consensus adjudication. Processing and rating times, field applicability, acceptance rates, error class distributions, text coverage, Matthews Correlation Coefficient (MCC) for 31 binary-classifiable keys were measured. Two models were compared using paired t-tests, Wilcoxon signed-rank tests, paired-difference linear mixed model (LMM) accounting for patient-level clustering. All extractions produced valid schema-conformant outputs. Median extraction time was 104.8s (Mistral). Median rating time was 111.5s per report. Field applicability rates varied across fields (median 19.3%). Overall acceptance rate was 97.5%, decreasing to 76.0% for non-empty fields. Content errors dominated (84.8% of rejected extractions), primarily missing information (55.0%; 3.8% of all extractions) and unsupported inferences (17.3%). Template coverage averaged 93.9%. MCC was 0.799. Mistral-Small-3.2-24B significantly outperformed Llama-3.3-70B (LMM: mean difference=0.098, 95% CI 0.079, 0.117, p<0.001; paired t-test: p<0.001). Our open-source, FHIR-compatible platform provides infrastructure for retrospective validation and systematic evaluation of automated entity extraction methods in radiological reports, enabling reproducible cross-institutional comparison of extraction. Processing latencies under research-grade hardware (median 104.8s for Mistral-Small-3.2-24B) reflect infrastructure constraints rather than fundamental limitations; clinical deployment requires further optimization beyond this pilot study. • Open-source assessment platform for standardized evaluation of automated radiological report entity extraction • 11-category error taxonomy enables systematic failure analysis of NLP-extractions • Smaller model (Mistral-24B) outperformed larger model (Llama-70B) by 22.6% in feasibility study on German chest X-Ray reports with zero shot prompting • Modular architecture enables reproducible cross-institutional benchmarking
Römer et al. (Wed,) studied this question.