OCR engines operating at reduced resolution produce degraded text with systematic character-level errors — broken words, misrecognized characters, and corrupted numerical sequences. Current post-processing approaches rely on fuzzy matching algorithms that achieve moderate quality at near-zero latency but fail on domain-specific terminology and structured sequences absent from predefined vocabularies. Prior work established that a Ray-based parallel OCR pipeline achieves 69.9× speedup on 11,368 banking document pages but produces a Character Error Rate of 24.78% at 100 DPI — a quality gap that fuzzy matching cannot close. This paper proposes LLM-Guided Sequence Reconstruction (LLM-GSR), a post-processing architecture that replaces dictionary-based correction with a large language model operating as a sequence predictor over degraded OCR output. The key insight is that OCR degradation produces incomplete sequences that a language model can reconstruct through next-token prediction conditioned on domain context — precisely the task for which language models are optimized. We formalize the boundary of applicability through a Reconstruction Precondition grounded in information theory, and validate the architecture on the same 11,368-page banking corpus, measuring CER reduction, latency overhead, and throughput under parallel inference.
Alejandro Jaime (Sun,) studied this question.