What question did this study set out to answer?

The aim is to evaluate how well large language models can extract structured data from corrupted OCR text.

March 16, 2026Open Access

LORE: LLM OCR Robustness Evaluation

Key Points

The aim is to evaluate how well large language models can extract structured data from corrupted OCR text.
Developed a benchmark called LORE for evaluating OCR data extraction capabilities.
Generated 1,200 synthetic samples across three document types: retail receipts, insurance policies, and hospital records.
Implemented four difficulty tiers from light noise to adversarial traps, assessing extraction efficacy.
Benchmark included models from 3B to 70B parameters, like Llama 3.2 and Llama 3.3.
Field-presence F1 scores (0.82–0.95) masked significant extraction quality differences.
Exact match rates varied from 0.42 to 0.66 across models.
Correction gain ranged from −0.88 to +0.49, indicating varied effectiveness in text correction.
Sub-7B models often degraded OCR quality, while 70B models showed reliable correction capabilities.

Abstract

We present LORE (LLM OCR Robustness Evaluation), a benchmark for measuring how well large lan-guage models extract and normalize structured data from corrupted OCR text. Real OCR pipelines producesystematic noise—character substitutions, merged lines, truncated values, and formatting corruption—and the ability of LLMs to correct this noise into clean structured records is underexplored. LOREprovides 1,200 synthetic samples across three Indian-context document domains (retail receipts, in-surance policies, and hospital visit records), four difficulty tiers ranging from light character noiseto adversarial semantic traps, and five evaluation dimensions. We benchmark four models spanning3B to 70B parameters: Llama 3.2 (3B), Phi 3.5 (3.8B), Qwen 2.5 (7B), and Llama 3.3 (70B). Ourcentral finding is that standard field-presence F1 masks substantial differences in extraction quality:models scoring 0.82–0.95 on F1 diverge dramatically on exact match rate (0.42–0.66) and correctiongain (−0.88 to +0.49). Sub-7B models consistently degrade OCR quality rather than correcting it,while only models at 70B scale demonstrate reliable correction capability. LORE is publicly available athttps://github.com/ashwin549/lore-benchmark

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Ashwin Shetty (Sat,) studied this question.

synapsesocial.com/papers/69b79ea18166e15b153ac2f4 https://doi.org/https://doi.org/10.5281/zenodo.19018392

Bookmark

View Full Paper