基于端到端的复杂场景中文文字识别方法研究
Research on End-to-End Chinese Text Recognition Method in Complex Scenes
DOI: 10.12677/HJDM.2023.132015, PDF,   
作者: 帅梓涵, 胡金蓉, 郎子鑫, 罗月梅, 李桂钢:成都信息工程大学计算机学院,四川 成都
关键词: 端到端文字识别Transformer深度学习End-to-End Text Recognition Transformer Deep Learning
摘要: 近年来,由于成功挖掘了场景文本检测和识别的内在协同作用,端到端场景文本识别引起了人们的极大关注。然而,最近最先进的方法通常仅通过共享主干来结合检测和识别,这些方法由于其尺度和纵横比的极端变化不能很好地处理场景文本。在本文中,我们提出了一种新的端到端场景文本识别框架,称为ES-Transformer。与以往以整体方式学习场景文本的方法不同,我们的方法基于几个代表性特征来执行场景文本识别,这避免了背景干扰并降低了计算成本。具体来说,使用基本特征金字塔网络进行特征提取,然后,我们采用Swin-Transformer来建模采样特征之间的关系,从而有效地将它们划分为合理的组。在提升识别精度的同时降低了计算复杂度,不再依赖于繁杂的后处理模块。对中文数据集的定性和定量实验表明,ES-Transformer优于现有方法。
Abstract: In recent years, due to the successful exploration of the inherent synergistic effect of scene text detection and recognition, end-to-end scene text recognition has attracted great attention. However, the most recent state-of-the-art methods usually only combine detection and recognition by sharing backbones, and these methods cannot handle scene text well due to extreme variations in scale and aspect ratio. In this paper, we propose a new end-to-end scene text recognition framework called ES-Transformer. Unlike previous methods that learn scene text in a holistic way, our approach per-forms scene text recognition based on several representative features, which avoids background interference and reduces computational cost. Specifically, we use a basic feature pyramid network for feature extraction, and then we employ Swin-Transformer to model the relationships between the sampled features, effectively partitioning them into reasonable groups. By improving recognition accuracy and reducing computational complexity, ES-Transformer no longer relies on complex post-processing modules. Qualitative and quantitative experiments on Chinese datasets show that ES-Transformer outperforms existing methods.
文章引用:帅梓涵, 胡金蓉, 郎子鑫, 罗月梅, 李桂钢. 基于端到端的复杂场景中文文字识别方法研究[J]. 数据挖掘, 2023, 13(2): 154-164. https://doi.org/10.12677/HJDM.2023.132015

参考文献

[1] He, M.H., et al. (2021) MOST: A Multi-Oriented Scene Text Detector with Localization Refinement. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 8809-8818. [Google Scholar] [CrossRef
[2] Li, X., et al. (2018) Shape Robust Text Detection with Pro-gressive Scale Expansion Network. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 16-17 June 2019, 9328-9337.
[3] Vaswani, A., Shazeer, N., Parmar, N., et al. (2007) Attention Is All You Need. Proceedings NIPS, Vancouver, 3-6 December 2007, 5998-6008.
[4] Carion, N., Massa, F., Synnaeve, G., et al. (2020) End-to-End Object Detection with Transformers. Computer Vision-ECCV 2020: 16th European Conference, Glasgow, 23-28 August 2020, 213-229. [Google Scholar] [CrossRef
[5] Dai, X., Chen, Y., Yang, J., et al. (2021) Dynamic detr: End-to-End Object Detection with Dynamic Attention. Proceedings of the IEEE/CVF International Conference on Com-puter Vision, Montreal, 11-17 October 2021, 2988-2997. [Google Scholar] [CrossRef
[6] Meng, D., Chen, X., Fan, Z., Zeng, G., et al. (2021) Condi-tional detr for Fast Training Convergence. Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, 11-17 October 2021, 3651-3660. [Google Scholar] [CrossRef
[7] Zhu, X., Su, W., Lu, L., et al. (2020) Deformable DETR: Deformable Transformers for End-to-End Object Detection. ICLR 2021, 3-7 May 2021, 1-16.
[8] Liu, W., Anguelov, D., Erhan, D., et al. (2016) Ssd: Single Shot Multibox Detector. Computer Vision-ECCV 2016: 14th European Confer-ence, Amsterdam, 11-14 October 2016, 21-37. [Google Scholar] [CrossRef
[9] Raisi, Z., Naiel, M.A., Younes, G., et al. (2021) Transformer-Based Text Detection in the Wild. Proceedings of the IEEE/CVF Confer-ence on Computer Vision and Pattern Recognition, Nashville, 20-25 June 2021, 3162-3171. [Google Scholar] [CrossRef
[10] Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., et al. (2015) ICDAR 2015 Competition on Robust Reading. 2015 13th International Conference on Document Analysis and Recognition (ICDAR) IEEE, Tunis, 23-26 August 2015, 1156-1160. [Google Scholar] [CrossRef
[11] Nayef, N., Yin, F., Bizid, I., et al. (2017) Icdar2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification-rrc-mlt. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) IEEE, Vol. 1, 1454-1459. [Google Scholar] [CrossRef
[12] Yu, D., Li, X., Zhang, C., et al. (2020) Towards Accurate Scene Text Recognition with Semantic Reasoning Networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 14-19 June 2020, 12113-12122. [Google Scholar] [CrossRef
[13] Fang, S., Xie, H., Wang, Y., et al. (2021) Read like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, 20-25 June 2021, 7098-7107. [Google Scholar] [CrossRef
[14] Wang, Y., Xie, H., Fang, S., et al. (2021) From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network. Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, 11-17 October 2021, 14194-14203. [Google Scholar] [CrossRef
[15] Baek, Y., Lee, B., Han, D., Yun, S., et al. (2019) Character Region Awareness for Text Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 15-20 June 2019, 9365-9374. [Google Scholar] [CrossRef
[16] Liao, M., Shi, B. and Bai, X. (2018) Textboxes++: A Single-Shot Oriented Scene Text Detector. IEEE Transactions on Image Processing, 27, 3676-3690. [Google Scholar] [CrossRef
[17] Liao, M., Wan, Z., Yao, C., et al. (2020) Real-Time Scene Text Detection with Differentiable Binarization. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 11474-11481. [Google Scholar] [CrossRef
[18] Zhou, X., Yao, C., Wen, H., et al. (2017) East: An Ef-ficient and Accurate Scene Text Detector. Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-tion, Honolulu, 21-26 July 2017, 5551-5560. [Google Scholar] [CrossRef
[19] Li, Y., Wu, Z., Zhao, S., et al. (2020) PSENet: Psoriasis Severity Evaluation Network. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 800-807. [Google Scholar] [CrossRef
[20] Lyu, P., Liao, M., Yao, C., et al. (2018) Mask Textspotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes. Proceedings of the European Conference on Computer Vision (ECCV), Munich, 8-14 September 2018, 67-83. [Google Scholar] [CrossRef
[21] Zhang, R., Zhou, Y., Jiang, Q., et al. (2019) Icdar 2019 Robust Reading Challenge on Reading Chinese Text on Signboard. 2019 International Conference on Document Analysis and Recognition (ICDAR) IEEE, Sydney, 20-25 September 2019, 1577-1581. [Google Scholar] [CrossRef
[22] Tan, M. and Le, Q. (2019) Efficientnet: Rethinking model Scaling for Convolutional Neural Networks. International Conference on Machine Learning. PMLR, Long Beach, 9-15 June 2019, 6105-6114.
[23] Liu, R., Lehman, J., Molino, P., et al. (2018) An Intriguing Failing of Convolutional Neural Networks and the Coordconv Solution. Proceedings of the 32nd International Conference on Neural Information Pro-cessing Systems, Montréal, 2-8 December 2018, 9628-9639.
[24] Liu, Z., Lin, Y., Cao, Y., Hu, H., et al. (2021) Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, 11-17 October 2021, 10012-10022. [Google Scholar] [CrossRef
[25] Sun, Y., Ni, Z., Chng, C.K., Liu, Y., et al. (2019) ICDAR 2019 Competition on Large-Scale Street View Text with Partial Labeling-rrc-lsvt. 2019 IEEE International Conference on Document Analysis and Recognition (ICDAR), Sydney, 20-25 September 2019, 1557-1562. [Google Scholar] [CrossRef
[26] Chng, C.K., Liu, Y.L., Sun, Y.P., et al. (2019) Icdar2019 Robust Reading Challenge on Arbitrary-Shaped Text-rrc-art. 2019 IEEE International Conference on Document Analysis and Recognition (ICDAR), Sydney, 20-25 September 2019, 1571-1576. [Google Scholar] [CrossRef