融合GRU和非极大值抑制的视频摘要生成模型
Model of Video Summarization Integrating GRU and Non-Maximum Suppression
摘要: 现有视频摘要生成模型存在计算量大,冗余帧带来的性能损耗大,模型效果不稳定等问题。基于此,提出融合GRU和非极大值抑制的视频摘要生成模型。所提模型对视频帧之间的特征关系进行建模,在获取帧级重要性得分模块中,提出一种融入GRU和注意力机制的Seq2Seq模型,增强帧与帧之间的时域特征关系影响,并且有效减少模型计算量,提高模型在反向传播时的收敛速度;在获取视频摘要模块中,提出基于非极大值抑制的关键序列生成算法,有效去除冗余帧。通过在多个数据集上与现今主流的视频摘要生成模型比对,显示所提模型在F-score和KFRR两个评估指标上均有不同程度的提升,表明其所生成的视频摘要具有更强的内容概括能力,并且模型在各种数据状况下具有较高的稳定性。
Abstract: Existing models of video summarization have problems such as too much calculation, large negative impact caused by redundant frames and unstable model effects. To deal with these problems, model of video summarization integrating GRU and non-maximum suppression is proposed. In the module of getting frame-level importance score, this paper proposes a kind of Seq2Seq model incorporating GRU and attention mechanism, which enhances the influence of the time-domain fea-ture relationship between frames and effectively reduces the amount of model calculations, improving convergence speed during back propagation. In the module of summarizing video, this paper proposes a key sequence generation algorithm based on non-maximum suppression, which effectively removes redundant frames. By comparing with the current mainstream models of video summarization on multiple datasets, it is shown that the proposed model has different degrees of improvement in the two evaluation indicators of F-score and KFRR, indicating that the generated video summarization has stronger content generalization ability, and the model has high stability under various data conditions.
文章引用:陈周元, 陈平华, 申建芳. 融合GRU和非极大值抑制的视频摘要生成模型[J]. 计算机科学与应用, 2021, 11(3): 604-617. https://doi.org/10.12677/CSA.2021.113062

参考文献

[1] 刘波. 视频摘要研究综述[J]. 南京信息工程大学, 2020, 12(3): 274-278.
[2] Amiri, A. and Fathy, M. (2010) Hier-archical Keyframe-Based Video Summarization Using QR-Decomposition and Modified-Means Clustering. EURASIP Journal on Advances in Signal Processing, 2010, Article ID: 892124. [Google Scholar] [CrossRef
[3] Guimaraes, S.J.F. and Gomes, W.A. (2010) Static Video Summarization Method Based on Hierarchical Clustering. In: Ibero-American Congress Conference on Progress in Pattern Recognition, Springer-Verlag, Berlin, 46-54. [Google Scholar] [CrossRef
[4] Frey, B.J. and Dueck, D. (2007) Clustering by Passing Mes-sages between Data Points. Science, 315, 972-976. [Google Scholar] [CrossRef] [PubMed]
[5] de Avila, S.E.F. and Lopes, A.P.B. (2011) VSUMM: A Mechanism Designed to Produce Static Video Summaries and a Novel Evaluation Method. Pattern Recognition Letters, 32, 56-68. [Google Scholar] [CrossRef
[6] Mundur, P., Rao, Y. and Yesha, Y. (2006) Keyframe-Based Video Summarization Using Delaunay Clustering. International Journal on Digital Libraries, 6, 219-232. [Google Scholar] [CrossRef
[7] Khosla, A., Hamid, R., Lin, C.J., et al. (2013) Large-Scale Video Summarization Using Web-Image Priors. 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, 23-28 June 2013, 2698-2705. [Google Scholar] [CrossRef
[8] Panda, R. (2017) Weakly Supervised Summarization of Web Videos. 2017 IEEE International Conference on Computer Vision, Venice, 22-29 October 2017, 3677-3686. [Google Scholar] [CrossRef
[9] Potapov, D., Douze, M., Harchaoui, Z., et al. (2014) Catego-ry-Specific Video Summarization. European Conference on Computer Vision, Zurich, 6-12 September 2014, 540-555. [Google Scholar] [CrossRef
[10] Zhang, K., Chao, W.L., Sha, F., et al. (2016) Summary Trans-fer: Exemplar-Based Subset Selection for Video Summarization. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 1059-1067. [Google Scholar] [CrossRef
[11] Gygli, M., Song, Y. and Cao, L. (2016) Video2GIF: Automatic Generation of Animated Gifs from Video. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 1001-1009. [Google Scholar] [CrossRef
[12] Sun, M., Farhadi, A. and Seitz, S. (2014) Ranking Domain-Specific Highlights by Analyzing Edited Videos. In: European Conference on Computer Vision, Springer, Berlin, 787-802. [Google Scholar] [CrossRef
[13] Zhao, B., Li, X.L. and Lu, X.Q. (2017) Hierarchical Recurrent Neural Network for Video Summarization. In: The 2017 ACM on Multimedia Conference, ACM, New York, 863-871. [Google Scholar] [CrossRef
[14] Zhang, K., Chao, W.L., Sha, F., et al. (2016) Video Summarization with Long Short-Term Memory. In: European Conference on Computer Vision, Springer, Berlin, 766-782. [Google Scholar] [CrossRef
[15] 冀中, 江俊杰. 基于解码器注意力机制的视频摘要[J]. 天津大学学报(自然科学与工程技术版), 2018, 51(10): 31-38.
[16] Mahasseni, B., Lam, M. and Todorovic, S. (2017) Unsupervised Video Summarization with Adversarial LSTM Networks. IEEE Conference on Computer Vision and Pat-tern Recognition, Honolulu, 21-26 July 2017, 2982-2991. [Google Scholar] [CrossRef
[17] Yang, H., Wang, B.Y., Lin, S., et al. (2015) Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-Encoders. Proceedings of the IEEE International Conference on Com-puter Vision, Santiago, 7-13 December 2015, 4633-4641. [Google Scholar] [CrossRef
[18] Sutskever, I., Vinyals, O. and Le, Q.V. (2014) Sequence to Sequence Learning with Neural Networks. Advances in Neural Infor-mation Processing Systems, 32, 3452-3462.
[19] Szegedy, C., Liu, W., Jia, Y., et al. (2014) Going Deeper with Con-volutions. 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 23-28 June 2014, 1-9. https://ieeexplore.ieee.org/document/7298594 [Google Scholar] [CrossRef
[20] Cho, K., Merrieenboer, B., Gulcehre, C., et al. (2014) Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. Conference on Em-pirical Methods in Natural Language Processing (EMNLP 2014), Doha, 25-29 October 2014, 1724-1734. [Google Scholar] [CrossRef
[21] Bahdanau, D., Cho, K. and Bengio, Y. (2015) Neural Machine Transla-tion by Jointly Learning to Align and Translate. International Conference on Learning Representation, San Diego, 7-9 May 2015, 1334-1349.
[22] Song, Y., Vallmitjana, J., Stent, A., et al. (2015) TVSum: Summarizing Web Videos Using Titles. 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, 7-12 June 2015, 5179-5187.
[23] Gygli, M., Grabner, H., Riemenschneider, H., et al. (2014) Creating Summaries from User Videos. In: European Conference on Computer Vision, Springer, Cham, 505-520. [Google Scholar] [CrossRef