November 23, 2025Open Access

A Two-Stage End-to-End Framework for Robust Scene Text Spotting with Self-Calibrated Detection and Contextual Recognition

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

End-to-end scene text detection and recognition, which involves detecting and recognizing text in natural images, still faces significant challenges, particularly in handling text of arbitrary shapes, complex backgrounds, and computational efficiency requirements. This paper proposes a novel and viable end-to-end OCR framework that synergistically combines a powerful detection network with advanced recognition models. For text detection, we develop a method called Text Contrast Self-Calibrated Network (TextCSCN), which employs pixel-wise supervised contrastive learning to extract more discriminative features. TextCSCN addresses long-range dependency modeling and limited receptive field issues through self-calibrated convolutions and Global Convolutional Networks (GCNs). We further introduce an efficient Mamba-based bidirectional module for boundary refinement, enhancing both accuracy and speed. For text recognition, our framework employs a Swin Transformer backbone with Bidirectional Feature Pyramid Networks (BiFPNs) for optimized multi-scale feature extraction. We propose a Pre-Gated Contextual Attention Gate (PCAG) mechanism to effectively fuse visual and linguistic information while minimizing noise and uncertainty in multi-modal integration. Experiments on challenging benchmarks including TotalText and CTW1500 demonstrate the effectiveness of our approach. Our detection module achieves state-of-the-art performance with an F-score of 88.21% on TotalText, and the complete end-to-end system shows comparable improvements in recognition accuracy, establishing new benchmarks for scene text spotting.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo