We present LORE (LLM OCR Robustness Evaluation), a benchmark for measuring how well large lan-guage models extract and normalize structured data from corrupted OCR text. Real OCR pipelines producesystematic noise—character substitutions, merged lines, truncated values, and formatting corruption—and the ability of LLMs to correct this noise into clean structured records is underexplored. LOREprovides 1,200 synthetic samples across three Indian-context document domains (retail receipts, in-surance policies, and hospital visit records), four difficulty tiers ranging from light character noiseto adversarial semantic traps, and five evaluation dimensions. We benchmark four models spanning3B to 70B parameters: Llama 3.2 (3B), Phi 3.5 (3.8B), Qwen 2.5 (7B), and Llama 3.3 (70B). Ourcentral finding is that standard field-presence F1 masks substantial differences in extraction quality:models scoring 0.82–0.95 on F1 diverge dramatically on exact match rate (0.42–0.66) and correctiongain (−0.88 to +0.49). Sub-7B models consistently degrade OCR quality rather than correcting it,while only models at 70B scale demonstrate reliable correction capability. LORE is publicly available athttps://github.com/ashwin549/lore-benchmark
Ashwin Shetty (Sat,) studied this question.