基于多模态注意机制的全域视频描述生成技术研究
Research of Multimodal Attention-Based Description Generation of Videos in Wide Domain
摘要: 基于多模态注意机制的深度神经网络模型,提出了一种针对全域视频的多语言描述自动生成技术。视频描述自动生成模型由端到端的卷积神经网络和双向循环神经网络组成,应用多模态注意机制,显著提升了模型的视频表征能力。通过构建双向循环神经网络编码器,对图像、光流、C3D以及音频等4种多模态视频特征进行融合编码,并引入基于注意机制的解码器,将编码获得的视频序列化特征最终解码为多语言描述序列。模型在开源视频描述数据集上进行了测试实验,实验结果表明了该方法的有效性,其中METEOR值提升了3.31%,为目前已公开的最佳结果。因此,该技术可作为相关领域研究的重要参考。
Abstract: Based on the deep neural network model of multimodal attention mechanism, this paper proposes an automatic generation technology of multilingual description for global video. The automatic video description generation model is composed of an end-to-end convolutional neural network and a bidirectional cyclic neural network. The multi-modal attention mechanism is applied to significantly improve the video representation ability of the model. By constructing a bidirectional recurrent neural network encoder, four multimodal video features such as image, optical flow, C3d and audio are fused and encoded. And a decoder based on attention mechanism is introduced to decode the encoded video serialization features into a multilingual description sequence. The model has been tested on the open source video description dataset, and the experimental results show the effectiveness of the method, of which the meteor value has increased by 3.31%, which is the best result that has been published so far. Therefore, this technology can be used as an important reference for research in related fields.
文章引用:杜晓童. 基于多模态注意机制的全域视频描述生成技术研究[J]. 计算机科学与应用, 2022, 12(10): 2225-2232. https://doi.org/10.12677/CSA.2022.1210226

参考文献

[1] Barbu, A., Bridge, A., Burchill, Z., et al. (2012) Video in Sentences Out. 28th Conference on Uncertainty in Artificial Intelligence, Catalina Island, 14-18 August 2012, 274-283.
[2] Venugopalan, S., Xu, H., Donahue, J., et al. (2014) Translating Videos to Natural Language Using Deep Recurrent Neural Networks. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, May-June 2015, 1494-1504. [Google Scholar] [CrossRef
[3] Venugopalan, S., Rohrbach, M., Donahue, J., et al. (2015) Sequence to Sequence—Video to Text. IEEE International Conference on Computer Vision, Santiago, 7-13 December 2015, 4534-4542. [Google Scholar] [CrossRef
[4] 汤鹏杰, 王瀚漓. 从视频到语言: 视频标题生成与描述研究综述[J]. 自动化学报, 2022, 48(2): 375-397.
[5] Chen, X. and Zitnick, C.L. (2015) Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation. 2015 IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), Boston, 7-12 June 2015, 2422-2431. [Google Scholar] [CrossRef
[6] Rohrbach, M., Qiu, W., Titov, I., et al. (2013) Translating Video Content to Natural Language Descriptions. IEEE International Conference on Computer Vision, Sydney, 1-8 December 2013, 433-440. [Google Scholar] [CrossRef
[7] 付燕, 马钰, 叶鸥. 融合深度学习和视觉文本的视频描述方法[J]. 科学技术与工程, 2021, 21(14): 5855-5861.
[8] 孙红莲, 李永刚, 季兴隆, 王霈烨, 吴小旭. 基于深度神经网络和自注意力的视频事件描述[J]. 电脑知识与技术: 学术版, 2020, 16(33): 187-189.
[9] Xu, H., Venugopalan, S., Ramanishka, V., et al. (2015) A Multi-Scale Multiple Instance Video Description Network. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 272-279.
[10] Pasunuru, R. and Bansal, M. (2017) Multi-Task Video Captioning with Video and Entailment Generation. Meeting of the Association for Computa-tional Linguistics, Vancouver, 30 July-4 August 2017, 1273-1283. [Google Scholar] [CrossRef
[11] 王金金, 曾上游, 李文惠, 等. 基于扩张卷积的注意力机制视频描述模型[J]. 电子测量技术, 2021, 44(23): 99-104.
[12] Jin, Q., Chen, J., Chen, S., et al. (2016) Describing Videos Using Multi-Modal Fusion. ACM on Multimedia Conference, Amsterdam, 15-19 October 2016, 1087-1091. [Google Scholar] [CrossRef
[13] Ramanishka, V., Das, A., Dong, H.P., et al. (2016) Multimodal Video Description. ACM on Multimedia Conference, Amsterdam, 15-19 October 2016, 1092-1096. [Google Scholar] [CrossRef
[14] 曹磊, 万旺根, 侯丽. 基于多特征的视频描述生成算法研究[J]. 电子测量技术, 2020, 43(16): 99-103.
[15] He, K., Zhang, X., Ren, S. and Sun, J. (2017) Deep Residual Learning for Image Recognition. Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 770-778. [Google Scholar] [CrossRef
[16] Tran, D., Bourdev, L., Fergus, R., et al. (2016) Learning Spatiotem-poral Features with 3D Convolutional Networks. 2015 IEEE International Conference on Computer Vision (ICCV), San-tiago, 7-13 December 2015, 4489-4497. [Google Scholar] [CrossRef
[17] Yi, B., Yang, Y., Shen, F., et al. (2018) Bidirectional Long-Short Term Memory for Video Description. ACM on Multimedia Conference, Seoul, 22-26 October 2018, 436-440.
[18] Peris, Á., Bolaños, M., Radeva, P. and Casacuberta, F. (2019) Video Description Using Bidirectional Recurrent Neural Networks. International Conference on Artificial Neural Networks, Munich, 17-19 September 2019, 3-11. [Google Scholar] [CrossRef
[19] Cho, K., Van Merrienboer, B., Bahdanau, D., et al. (2014) On the Properties of Neural Machine Translation: Encoder- Decoder Approaches. Proceedings of SSST-8, 8th Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, 25 October 2014, 103-111.