基于注意力机制的自然场景文本检测算法
Natural Scene Text Detection Algorithm Based on Attention Mechanism
摘要: 针对目前主流场景文本检测算法在进行多尺度特征融合时不能够充分利用高、低层信息造成的文本漏检,以及长文本边界检测错误的问题,本文提出一种应用注意力机制的多尺度特征融合与残差坐标注意力的场景文本检测算法。该算法将注意力特征融合模块嵌入到金字塔中,通过纠正不同尺度特征的不一致性来提取更多的细节信息,以改善文本的漏检;在融合之后,使用残差坐标注意力模块在纵、横两个方向上捕获方向感知和位置敏感信息,细化边界信息,以优化长文本检测的效果。通过在公开数据集ICDAR 2015和Total-Text上的实验结果表明,该算法在F分数上分别达到了85.5%和83.6%,在推理速度上分别达到了22.4 FPS和40 FPS,相较于DBNet网络,在推理速度上略有下降,但在F分数上分别提高3.2%和0.8%。
Abstract: Aiming at the problems of text omission caused by the failure of the mainstream scene text detec-tion algorithm to make full use of the high and low-level information in the multi-scale feature fusion, and the error of long text boundary detection, this paper proposes a scene text detection algorithm which applies the multi-scale feature fusion of attention mechanism and the residual coordinate attention. The model embedded the attention feature fusion module into the pyramid. It extracts more detailed information by correcting the inconsistency of features at different scales to improve the missed detection of text; after feature fusion, the residual coordinate attention module is used to capture orientation-aware and position-sensitive information in vertical and horizontal directions, refine boundary information to optimize the effect of long text detection. The experimental results on the public datasets ICDAR 2015 and Total-Text show that the model achieves 85.5% and 83.6% in F-measure, respectively, and 22.4 FPS and 40 FPS in inference speed. Compared with the DBNet network, this network has a slight decrease in inference speed, but 3.2% and 0.8% improvement in F-measure, respectively.
文章引用:王宪伟, 洪智勇, 余文华, 王惠吾, 吴卓霖. 基于注意力机制的自然场景文本检测算法[J]. 计算机科学与应用, 2022, 12(11): 2608-2618. https://doi.org/10.12677/CSA.2022.1211265

参考文献

[1] 刘崇宇, 陈晓雪, 罗灿杰, 金连文, 薛洋, 刘禹良. 自然场景文本检测与识别的深度学习方法[J]. 中国图象图形学报, 2021, 26(6): 1330-1367.
[2] Liao, M., Shi, B., Bai, X., et al. (2017) Textboxes: A Fast Text Detector with a Single Deep Neural Network. Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, 4-9 February 2017, 4161-4167. [Google Scholar] [CrossRef
[3] Tian, Z., Huang, W., He, T., et al. (2016) Detecting Text in Natural Image with Connectionist Text Proposal Network. In: Leibe, B., Matas, J., Sebe, N., Welling, M., Eds., European Conference on Computer Vision, Vol. 9912, 56-72. [Google Scholar] [CrossRef
[4] Ma, J., Shao, W., Ye, H., et al. (2018) Arbitrary-Oriented Scene Text Detection via Rotation Proposals. IEEE Transactions on Multimedia, 20, 3111-3122. [Google Scholar] [CrossRef
[5] Liao, M., Shi, B. and Bai, X. (2018) Textboxes++: A Sin-gle-Shot Oriented Scene Text Detector. IEEE Transactions on Image Processing, 27, 3676-3690. [Google Scholar] [CrossRef
[6] Liao, M., Zhu, Z., Shi, B., et al. (2018) Rotation-Sensitive Re-gression for Oriented Scene Text Detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-22 June 2018, 5909-5918. [Google Scholar] [CrossRef
[7] Liu, Y.L., et al. (2017) Detecting Curve Text in the Wild: New Dataset and New Solution. ArXiv, 1712.02170.
[8] Wang, X., Jiang, Y., Luo, Z., et al. (2019) Arbitrary Shape Scene Text Detection with Adaptive Text Region Representation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 16-20 June 2019, 6449-6458. [Google Scholar] [CrossRef
[9] Long, J., Shelhamer, E. and Darrell, T. (2015) Fully Convolutional Networks for Semantic Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 7-12 June 2015, 3431-3440. [Google Scholar] [CrossRef
[10] Zhang, Z., Zhang, C., Shen, W., et al. (2016) Multi-Oriented Text Detection with Fully Convolutional Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 4159-4167. [Google Scholar] [CrossRef
[11] Yao, C., Bai, X., Sang, N., et al. (2016) Scene Text Detection via Holistic, Multi-Channel Prediction. ArXiv, 1606.09002.
[12] Long, S., Ruan, J., Zhang, W., et al. (2018) Textsnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. Proceedings of the European Conference on Computer Vision (ECCV), Munich, 8-14 September 2018, 20-36. [Google Scholar] [CrossRef
[13] Wang, W., Xie, E., Li, X., et al. (2019) Shape Robust Text Detection with Progressive Scale Expansion Network. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 16-20 June 2019, 9336-9345. [Google Scholar] [CrossRef
[14] Wang, W., Xie, E., Song, X., et al. (2019) Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network. Proceedings of the IEEE/CVF International Confer-ence on Computer Vision, Seoul, 27-28 October 2019, 8440-8449. [Google Scholar] [CrossRef
[15] Deng, D., Liu, H., Li, X., et al. (2018) Pixellink: Detecting Scene Text via Instance Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, 2-7 February 2018, 6773-6780. [Google Scholar] [CrossRef
[16] Xu, Y., Wang, Y., Zhou, W., et al. (2019) Textfield: Learning a Deep Direction Field for Irregular Scene Text Detection. IEEE Transactions on Image Processing, 28, 5566-5579. [Google Scholar] [CrossRef
[17] Liao, M., Wan, Z., Yao, C., et al. (2020) Real-Time Scene Text Detection with Differentiable Binarization. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 7, 11474-11481. [Google Scholar] [CrossRef
[18] 谢斌红, 秦耀龙, 张英俊. 基于学习主动中心轮廓模型的场景文本检测[J]. 计算机工程, 2022, 48(3): 224-252+262.
[19] Dai, Y., Gieseke, F., Oehmcke, S., et al. (2021) Attentional Feature Fusion. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, 3-8 January 2021, 3560-3569. [Google Scholar] [CrossRef
[20] Hou, Q., Zhou, D. and Feng, J. (2021) Coordinate Atten-tion for Efficient Mobile Network Design. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, 19-25 June 2021, 13713-13722. [Google Scholar] [CrossRef
[21] Dai, J., Qi, H., Xiong, Y., et al. (2017) Deformable Con-volutional Networks. Proceedings of the IEEE International Conference on Computer Vision, Venice, 22-29 October 2017, 764-773. [Google Scholar] [CrossRef
[22] He, K., Zhang, X., Ren, S., et al. (2016) Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 770-778. [Google Scholar] [CrossRef
[23] Lin, T.Y., Dollár, P., Girshick, R., et al. (2017) Feature Pyramid Networks for Object Detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 2117-2125. [Google Scholar] [CrossRef
[24] Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., et al. (2015) ICDAR 2015 Competition on Robust Reading. 2015 IEEE 13th International Conference on Document Analysis and Recognition, Tunis, 23-26 August 2015, 1156-1160. [Google Scholar] [CrossRef
[25] Chng, C.K. and Chan, C.S. (2017) Total-Text: A Compre-hensive Dataset for Scene Text Detection and Recognition. In: Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition, IAPR Press, New York, 935-942. [Google Scholar] [CrossRef
[26] Shi, B., Bai, X. and Belongie, S. (2017) Detecting Oriented Text in Natural Images by Linking Segments. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 2550-2558. [Google Scholar] [CrossRef
[27] Zhou, X., Yao, C., Wen, H., et al. (2017) East: An Efficient and Accurate Scene Text Detector. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 5551-5560. [Google Scholar] [CrossRef
[28] Zhang, C., Liang, B., Huang, Z., et al. (2019) Look More than Once: An Accurate Detector for Text of Arbitrary Shapes. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 16-20 June 2019, 10552-10561. [Google Scholar] [CrossRef
[29] Wei, J., Wang, Q., Li, Z., et al. (2021) Shallow Feature Matters for Weakly Supervised Object Localization. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, 20-25 June 2021, 5993-6001. [Google Scholar] [CrossRef
[30] Hu, J., Shen, L. and Sun, G. (2018) Squeeze-and-Excitation Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 7132-7141. [Google Scholar] [CrossRef
[31] Woo, S., Park, J., Lee, J.Y., et al. (2018) CBAM: Convolutional Block Attention Module. Proceedings of the European Conference on Computer Vision (ECCV), Munich, 8-14 Sep-tember 2018, 3-19. [Google Scholar] [CrossRef