CIDER extracted structured clinical data from Hungarian pathology records with 77.1% to 99.4% accuracy across key variables including sex, T stage, N stage, and primary tumor organ.
Does the CIDER LLM system accurately extract structured clinical data from unstructured Hungarian-language pathology records compared to manual extraction?
A locally deployed, open-source LLM system can achieve near-expert level accuracy in structured data extraction from complex, non-English medical texts.
Tasa de eventos absoluta: 0% vs 0%
Abstract The analysis of unstructured medical records represents a crucial challenge in clinical research and healthcare. Large Language Models (LLMs) offer a transformative opportunity to extract structured information from narrative text; however, their use in medical environments is limited by security, ethical, and reproducibility issues. Here we present CIDER (ClinIcal Data ExtractoR), a locally deployed, open-source LLM-based system designed for the secure analysis of medical documentation. CIDER operates through an automated pipeline integrating vLLM-based inference, predefined data schemas, and prompt-engineered extraction rules to convert unstructured clinical text into structured variables. The system processes batch uploads, parsed reports using a fine-tuned model, and generates standardized output tables for direct analytical use. We evaluated CIDER’s ability to extract structured clinical data from real-world Hungarian-language pathology and histology records. Using the Qwen3-VL-32B-FP8 model as the backbone, we analyzed 2046 pathological records and validated the model’s outputs across six key clinical variables: sex, T stage, N stage, primary tumor organ, year of surgery, and tumor size. The extracted data were compared with manually mined data. When manual data were available, extraction accuracy was very high for sex (99.4%, 1971/1982 identical), T stage (95.34%, 879/922), N stage (92.19%, 437/474), year of surgery (97.94%, 1998/2040), and primary tumor organ (95.52%, 1771/1854). The largest tumor size reached an accuracy of 77.05% (1333/1730 identical). Notably, CIDER was also capable of retrieving clinically relevant information in cases where manual annotations were missing, identifying additional instances for sex (n=64), T stage (n=780), N stage (n=213), tumor size (n=291), year of surgery (n=6), and primary tumor organ (n=15). In summary, CIDER demonstrated strong performance across the evaluated parameters. These results show that a locally deployed, open-source LLM system can achieve near-expert level accuracy in structured data extraction from complex, non-English medical texts. By operating entirely within institutional infrastructure, CIDER ensures full data sovereignty and provides a scalable solution for automated medical record interpretation, supporting research, registry development, and clinical decision-making in multilingual healthcare environments. The CIDER platform is publicly accessible at https://llm.gyorffylab.com/cider. Citation Format: Mate Posta, Aida Figler, Zsofia Dobolyi, Balazs Gyorffy, . From chaos to columns: High-accuracy clinical data extraction with CIDER abstract. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts); 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86(7 Suppl):Abstract nr 2738.
Posta et al. (Fri,) reported a other. CIDER extracted structured clinical data from Hungarian pathology records with 77.1% to 99.4% accuracy across key variables including sex, T stage, N stage, and primary tumor organ.