What question did this study set out to answer?

To develop a model for recognizing unconstrained handwritten Japanese character strings without requiring extensive training data or character segmentation.

January 21, 2026Open Access

Handwritten Character String Recognition Using a String Recognition Transformer

Key Points

To develop a model for recognizing unconstrained handwritten Japanese character strings without requiring extensive training data or character segmentation.
Proposes the String Recognition Transformer (SRT) model
Integrates a convolutional neural network for feature extraction
Utilizes a Transformer encoder-decoder architecture
Employs a sliding window to create overlapping patches
Conducts comparative experiments with other recognition models.
Achieved a character error rate (CER) of 0.067
Outperformed convolutional recurrent neural networks, which had a CER of 0.664
Surpassed transformer-based optical character recognition with a CER of 0.165
Exceeded results from handwritten text recognition with Vision Transformer, which attained a CER of 0.106.

Abstract

Improving the accuracy of handwritten character string recognition allows handwritten documents to be converted into digital text. This facilitates camera-based text input, enabling robotic process automation to manage documentation tasks. Although this field has seen significant progress, recognizing handwritten Japanese remains particularly challenging due to the difficulty of character segmentation, the wide variety of character types, and the absence of clear word boundaries. These factors make unconstrained handwritten Japanese string recognition particularly difficult for conventional approaches. Moreover, transformer-based models typically require large amounts of annotated training data. This study proposes and investigates a new String Recognition Transformer (SRT) model capable of recognizing unconstrained handwritten Japanese character strings without relying on explicit character segmentation or a large number of training images. The SRT model integrates a convolutional neural network backbone for robust local feature extraction, a Transformer encoder-decoder architecture, and a sliding window strategy that generates overlapping patches. Comparative experiments show that our method achieved a character error rate (CER) of 0.067, significantly outperforming convolutional recurrent neural network, transformer-based optical character recognition, and handwritten text recognition with Vision Transformer which achieved CERs of 0.664, 0.165, and 0.106, respectively, thereby confirming the effectiveness and robustness of the approach.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Shunya Rakuka

Kento Morita

Tetsushi WAKABAYASHI

Journals

Journal of Advanced Computational Intelligence and Intelligent Informatics

Actions

Institutions

Mie University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Handwritten Character String Recognition Using a String Recognition Transformer

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study