基于多模态特征记忆库的视频语音检索模型
Video Speech Retrieval Model Based on Multimodal Feature Memory
摘要: 由于不同模态间的数据表示方式不一致,跨模态检索是多媒体领域中的一大难题。本文设计了一种基于多模态特征记忆库的视频语音检索模型,该模型主要分为三个模块,分别为特征提取模块,多模态特征映射融合模块和特征记忆库模块。在特征提取模块中,我们分别使用I3D和Bi-LSTM来提取视频中的操作动作特征和语音中的特征信息。在特征映射融合模块中,先将两种模态特征对齐到同一空间中,再进行融合。在第三个模块中,本文创新性地引入了两个对应视频和语音的特征记忆库,根据特定条件在训练和测试过程中不断更新。在经过我们拓展过的MPII Cooking 2数据集进行实验,结果表明我们的模型能够实现更好的视频语音检索效果。
Abstract: Cross-modal retrieval is a major challenge in multimedia field due to the inconsistent data representation among different modalities. In this paper, we design a video speech retrieval model based on multimodal feature memory library, which is divided into three main modules, namely, feature extraction module, multimodal feature mapping fusion module and feature memory library module. In the feature extraction module, we use I3D and Bi-LSTM to extract the operational action features in video and feature information in speech, respectively. In the feature mapping fusion module, the two modal features are first aligned into the same space and then fused. In the third module, two feature memories corresponding to video and speech are innovatively introduced in this paper, which are continuously updated during training and testing according to specific conditions. Experiments are conducted on our extended MPII Cooking 2 dataset, and the results show that our model can achieve better video-speech retrieval results.
文章引用:李劼博, 陈俊洪, 林大润, 杨振国, 刘文印. 基于多模态特征记忆库的视频语音检索模型[J]. 计算机科学与应用, 2022, 12(7): 1747-1755. https://doi.org/10.12677/CSA.2022.127176

参考文献

[1] Yang, H. and Meinel, C. (2014) Content Based Lecture Video Retrieval Using Speech and Video Text Information. IEEE Transactions on Learning Technologies, 7, 142-154. [Google Scholar] [CrossRef
[2] Owens, A., Wu, J.J., McDermott, J.H., Freeman, W.T. and Torralba, A. (2016) Ambient Sound Provides Supervision for Visual Learning. Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, 11-14 October 2016, 801-816. [Google Scholar] [CrossRef
[3] Gaver, W.W. (1993) What in the World Do We Hear? An Ecological Approach to Auditory Event Perception. Ecological Psychology, 5, 1-29. [Google Scholar] [CrossRef
[4] McDermott, J.H. and Simoncelli, E.P. (2011) Sound Texture Perception via Statistics of the Auditory Periphery: Evidence from Sound Synthesis. Neuron, 71, 926-940. [Google Scholar] [CrossRef] [PubMed]
[5] Darwin, C. and Prodger, P. (1998) The Expression of the Emo-tions in Man and Animals. Oxford University Press, Oxford.
[6] Tian, Y.-I., Kanade, T. and Cohn, J.F. (2001) Recog-nizing Action Units for Facial Expression Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23, 97-115. [Google Scholar] [CrossRef] [PubMed]
[7] Rouditchenko, A., Boggust, A., Harwath, D., Joshi, D., Thomas, S., Audhkhasi, K., Feris, R., Kingsbury, B., Picheny, M., Torralba, A. and Glass, J. (2020) AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. INTERSPEECH 2021, Brno, 30 August-3 Septem-ber 2021, 1584-1588. [Google Scholar] [CrossRef
[8] 董建锋. 跨模态检索中的相关度计算研究[D]: [博士学位论文]. 杭州: 浙江大学, 2018.
[9] Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T. and Saen-ko, K. (2015) Sequence to Sequence-Video to Text. Proceedings of the IEEE International Conference on Computer Vi-sion, Santiago, 11-18 December 2015, 4534-4542. [Google Scholar] [CrossRef
[10] Tran, D., Bourdev, L., Fergus, R., Torresani, L. and Paluri, M. (2015) Learning Spatiotemporal Features with 3d Convolutional Networks. IEEE International Conference on Computer Vision, Santiago, 7-13 December 2015, 4489-4497. [Google Scholar] [CrossRef
[11] Hershey, S., Chaudhuri, S., Ellis, D.P.W., et al. (2017) CNN Archi-tectures for Large-Scale Audio Classification. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, 5-9 March 2017, 131-135. [Google Scholar] [CrossRef
[12] Simonyan, K. and Zisserman, A. (2014) Very Deep Convolu-tional Networks for Large-Scale Image Recognition.
[13] Hotelling, H. (1992) Relations between Two Sets of Variates. In: Kotz, S. and Johnson, N.L., Eds., Breakthroughs in Statistics, Springer, New York, 162-190. [Google Scholar] [CrossRef
[14] Li, D., Dimitrova, N., Li, M., et al. (2003) Multimedia Content Processing through Cross-Modal Association. Proceedings of the Eleventh ACM International Conference on Multime-dia, Berkeley, 2-8 November 2003, 604-611. [Google Scholar] [CrossRef
[15] Liu, J., Xu, C. and Lu, H. (2010) Cross-Media Retrieval: State-of-the-Art and Open Issues. International Journal of Multimedia Intelligence and Security, 1, 33-52. [Google Scholar] [CrossRef
[16] Rasiwasia, N., Costa Pereira, J., Coviello, E., et al. (2010) A New Approach to Cross-Modal Multimedia Retrieval. Proceedings of the 18th ACM international conference on Multimedia, Firenze, 25-29 October 2010, 251-260. [Google Scholar] [CrossRef
[17] Feng, F., Wang, X. and Li, R. (2014) Cross-Modal Retrieval with Correspondence Autoencoder. Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, 3-7 November 2014, 7-16. [Google Scholar] [CrossRef
[18] Wang, K., Yin, Q., Wang, W., et al. (2016) A Comprehensive Sur-vey on Cross-Modal Retrieval.
[19] Hu, R., Xu, H., Rohrbach, M., et al. (2016) Natural Language Object Retrieval. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 4555-4564. [Google Scholar] [CrossRef
[20] Kamper, H., Shakhnarovich, G. and Livescu, K. (2018) Semantic Speech Retrieval with a Visually Grounded Model of Untranscribed Speech. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, 18-22 June 2018, 2514-2517.
[21] Mithun, N.C., Panda, R., Papalexakis, E.E., et al. (2018) Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval. Proceedings of the 26th ACM International Conference on Multimedia, Seoul, 22-26 October 2018, 1856-1864. [Google Scholar] [CrossRef
[22] Carreira, J. and Zisserman, A. (2017) Quo Vadis, Action Recogni-tion? A New Model and the Kinetics Dataset. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 6299-6308. [Google Scholar] [CrossRef
[23] Lin, T.Y., Maire, M., Belongie, S., et al. (2014) Microsoft COCO: Common Objects in Context. In: European Conference on Computer Vision, Springer, Cham, 740-755. [Google Scholar] [CrossRef
[24] Thompson, B. (2000) Canonical Correlation Analysis. In: Grimm, L.G. and Yarnold, P.R., Eds., Reading and Understanding MORE Multivariate Statistics, American Psycholog-ical Association, Washington DC, 285-316.
[25] Hwang, S.J. and Grauman, K. (2012) Learning the Relative Importance of Objects from Tagged Images for Retrieval and Cross-Modal Search. International Journal of Computer Vision, 100, 134-153. [Google Scholar] [CrossRef
[26] Andrew, G., Arora, R., Bilmes, J., et al. (2013) Deep Canonical Correlation Analysis. International Conference on Machine Learning, Atlanta, 17-19 June 2013, 1247-1255.
[27] Prétet, L., Richard, G. and Peeters, G. (2020) Learning to Rank Music Tracks Using Triplet Loss. ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, 4-8 May 2020, 511-515. [Google Scholar] [CrossRef
[28] Pons, J. and Serra, X. (2019) Musicnn: Pre-Trained Convolutional Neural Networks for Music Audio Tagging.
[29] Prétet, L., Richard, G. and Peeters, G. (2021) Cross-Modal Music-Video Recommendation: A Study of Design Choices. 2021 International Joint Conference on Neu-ral Networks (IJCNN) IEEE, Shenzhen, 18-22 July 2021, 1-9. [Google Scholar] [CrossRef