With the advancement of deep learning and optical character recognition (OCR) technology, extracting textual content from natural scene images has attracted increasing attention, especially in large-scale image analysis and multimedia content understanding. However, challenges remain in the endto- end processing of real-world images with diverse backgrounds, distortions, and varying font styles. Traditional OCR systems often struggle with character segmentation and low recall in noisy or complex scenes. To address these issues, we propose a complete OCR pipeline composed of three lightweight but effective stages: candidate region extraction using the Maximally Stable Extremal Regions (MSER) algorithm, text region detection using a shallow Convolutional Neural Network (CNN), and text content recognition using a combination of convolution layers, maxpooling layers, batch normalization layers, long short-term memory layers, full connection layers and connectionist temporal classification layers. This design avoids the need for character-level segmentation and allows the model to process arbitrary-length text sequences directly from cropped image regions. By training the detection and recognition models on a custom dataset including over 800,000 synthetic images and real-world scene samples, our approach achieves promising results. The detection module yields a precision of 93.5% and a recall of 88%, while the recognition module attains a characterlevel accuracy of 91.97%. These results demonstrate that the proposed system is capable of handling diverse natural scene text with high reliability and can be effectively applied in practical scenarios such as content indexing or automated annotation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Miao Peng (Wed,) studied this question.
Miao Peng
Building similarity graph...
Analyzing shared references across papers
Loading...