|
[1]
|
Barbu, A., Bridge, A., Burchill, Z., et al. (2012) Video in Sentences Out. 28th Conference on Uncertainty in Artificial Intelligence, Catalina Island, 14-18 August 2012, 274-283.
|
|
[2]
|
Venugopalan, S., Xu, H., Donahue, J., et al. (2014) Translating Videos to Natural Language Using Deep Recurrent Neural Networks. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, May-June 2015, 1494-1504. [Google Scholar] [CrossRef]
|
|
[3]
|
Venugopalan, S., Rohrbach, M., Donahue, J., et al. (2015) Sequence to Sequence—Video to Text. IEEE International Conference on Computer Vision, Santiago, 7-13 December 2015, 4534-4542. [Google Scholar] [CrossRef]
|
|
[4]
|
汤鹏杰, 王瀚漓. 从视频到语言: 视频标题生成与描述研究综述[J]. 自动化学报, 2022, 48(2): 375-397.
|
|
[5]
|
Chen, X. and Zitnick, C.L. (2015) Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation. 2015 IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), Boston, 7-12 June 2015, 2422-2431. [Google Scholar] [CrossRef]
|
|
[6]
|
Rohrbach, M., Qiu, W., Titov, I., et al. (2013) Translating Video Content to Natural Language Descriptions. IEEE International Conference on Computer Vision, Sydney, 1-8 December 2013, 433-440. [Google Scholar] [CrossRef]
|
|
[7]
|
付燕, 马钰, 叶鸥. 融合深度学习和视觉文本的视频描述方法[J]. 科学技术与工程, 2021, 21(14): 5855-5861.
|
|
[8]
|
孙红莲, 李永刚, 季兴隆, 王霈烨, 吴小旭. 基于深度神经网络和自注意力的视频事件描述[J]. 电脑知识与技术: 学术版, 2020, 16(33): 187-189.
|
|
[9]
|
Xu, H., Venugopalan, S., Ramanishka, V., et al. (2015) A Multi-Scale Multiple Instance Video Description Network. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 272-279.
|
|
[10]
|
Pasunuru, R. and Bansal, M. (2017) Multi-Task Video Captioning with Video and Entailment Generation. Meeting of the Association for Computa-tional Linguistics, Vancouver, 30 July-4 August 2017, 1273-1283. [Google Scholar] [CrossRef]
|
|
[11]
|
王金金, 曾上游, 李文惠, 等. 基于扩张卷积的注意力机制视频描述模型[J]. 电子测量技术, 2021, 44(23): 99-104.
|
|
[12]
|
Jin, Q., Chen, J., Chen, S., et al. (2016) Describing Videos Using Multi-Modal Fusion. ACM on Multimedia Conference, Amsterdam, 15-19 October 2016, 1087-1091. [Google Scholar] [CrossRef]
|
|
[13]
|
Ramanishka, V., Das, A., Dong, H.P., et al. (2016) Multimodal Video Description. ACM on Multimedia Conference, Amsterdam, 15-19 October 2016, 1092-1096. [Google Scholar] [CrossRef]
|
|
[14]
|
曹磊, 万旺根, 侯丽. 基于多特征的视频描述生成算法研究[J]. 电子测量技术, 2020, 43(16): 99-103.
|
|
[15]
|
He, K., Zhang, X., Ren, S. and Sun, J. (2017) Deep Residual Learning for Image Recognition. Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 770-778. [Google Scholar] [CrossRef]
|
|
[16]
|
Tran, D., Bourdev, L., Fergus, R., et al. (2016) Learning Spatiotem-poral Features with 3D Convolutional Networks. 2015 IEEE International Conference on Computer Vision (ICCV), San-tiago, 7-13 December 2015, 4489-4497. [Google Scholar] [CrossRef]
|
|
[17]
|
Yi, B., Yang, Y., Shen, F., et al. (2018) Bidirectional Long-Short Term Memory for Video Description. ACM on Multimedia Conference, Seoul, 22-26 October 2018, 436-440.
|
|
[18]
|
Peris, Á., Bolaños, M., Radeva, P. and Casacuberta, F. (2019) Video Description Using Bidirectional Recurrent Neural Networks. International Conference on Artificial Neural Networks, Munich, 17-19 September 2019, 3-11. [Google Scholar] [CrossRef]
|
|
[19]
|
Cho, K., Van Merrienboer, B., Bahdanau, D., et al. (2014) On the Properties of Neural Machine Translation: Encoder- Decoder Approaches. Proceedings of SSST-8, 8th Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, 25 October 2014, 103-111.
|