|
[1]
|
Yang, H. and Meinel, C. (2014) Content Based Lecture Video Retrieval Using Speech and Video Text Information. IEEE Transactions on Learning Technologies, 7, 142-154. [Google Scholar] [CrossRef]
|
|
[2]
|
Owens, A., Wu, J.J., McDermott, J.H., Freeman, W.T. and Torralba, A. (2016) Ambient Sound Provides Supervision for Visual Learning. Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, 11-14 October 2016, 801-816. [Google Scholar] [CrossRef]
|
|
[3]
|
Gaver, W.W. (1993) What in the World Do We Hear? An Ecological Approach to Auditory Event Perception. Ecological Psychology, 5, 1-29. [Google Scholar] [CrossRef]
|
|
[4]
|
McDermott, J.H. and Simoncelli, E.P. (2011) Sound Texture Perception via Statistics of the Auditory Periphery: Evidence from Sound Synthesis. Neuron, 71, 926-940. [Google Scholar] [CrossRef] [PubMed]
|
|
[5]
|
Darwin, C. and Prodger, P. (1998) The Expression of the Emo-tions in Man and Animals. Oxford University Press, Oxford.
|
|
[6]
|
Tian, Y.-I., Kanade, T. and Cohn, J.F. (2001) Recog-nizing Action Units for Facial Expression Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23, 97-115. [Google Scholar] [CrossRef] [PubMed]
|
|
[7]
|
Rouditchenko, A., Boggust, A., Harwath, D., Joshi, D., Thomas, S., Audhkhasi, K., Feris, R., Kingsbury, B., Picheny, M., Torralba, A. and Glass, J. (2020) AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. INTERSPEECH 2021, Brno, 30 August-3 Septem-ber 2021, 1584-1588. [Google Scholar] [CrossRef]
|
|
[8]
|
董建锋. 跨模态检索中的相关度计算研究[D]: [博士学位论文]. 杭州: 浙江大学, 2018.
|
|
[9]
|
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T. and Saen-ko, K. (2015) Sequence to Sequence-Video to Text. Proceedings of the IEEE International Conference on Computer Vi-sion, Santiago, 11-18 December 2015, 4534-4542. [Google Scholar] [CrossRef]
|
|
[10]
|
Tran, D., Bourdev, L., Fergus, R., Torresani, L. and Paluri, M. (2015) Learning Spatiotemporal Features with 3d Convolutional Networks. IEEE International Conference on Computer Vision, Santiago, 7-13 December 2015, 4489-4497. [Google Scholar] [CrossRef]
|
|
[11]
|
Hershey, S., Chaudhuri, S., Ellis, D.P.W., et al. (2017) CNN Archi-tectures for Large-Scale Audio Classification. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, 5-9 March 2017, 131-135. [Google Scholar] [CrossRef]
|
|
[12]
|
Simonyan, K. and Zisserman, A. (2014) Very Deep Convolu-tional Networks for Large-Scale Image Recognition.
|
|
[13]
|
Hotelling, H. (1992) Relations between Two Sets of Variates. In: Kotz, S. and Johnson, N.L., Eds., Breakthroughs in Statistics, Springer, New York, 162-190. [Google Scholar] [CrossRef]
|
|
[14]
|
Li, D., Dimitrova, N., Li, M., et al. (2003) Multimedia Content Processing through Cross-Modal Association. Proceedings of the Eleventh ACM International Conference on Multime-dia, Berkeley, 2-8 November 2003, 604-611. [Google Scholar] [CrossRef]
|
|
[15]
|
Liu, J., Xu, C. and Lu, H. (2010) Cross-Media Retrieval: State-of-the-Art and Open Issues. International Journal of Multimedia Intelligence and Security, 1, 33-52. [Google Scholar] [CrossRef]
|
|
[16]
|
Rasiwasia, N., Costa Pereira, J., Coviello, E., et al. (2010) A New Approach to Cross-Modal Multimedia Retrieval. Proceedings of the 18th ACM international conference on Multimedia, Firenze, 25-29 October 2010, 251-260. [Google Scholar] [CrossRef]
|
|
[17]
|
Feng, F., Wang, X. and Li, R. (2014) Cross-Modal Retrieval with Correspondence Autoencoder. Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, 3-7 November 2014, 7-16. [Google Scholar] [CrossRef]
|
|
[18]
|
Wang, K., Yin, Q., Wang, W., et al. (2016) A Comprehensive Sur-vey on Cross-Modal Retrieval.
|
|
[19]
|
Hu, R., Xu, H., Rohrbach, M., et al. (2016) Natural Language Object Retrieval. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 4555-4564. [Google Scholar] [CrossRef]
|
|
[20]
|
Kamper, H., Shakhnarovich, G. and Livescu, K. (2018) Semantic Speech Retrieval with a Visually Grounded Model of Untranscribed Speech. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, 18-22 June 2018, 2514-2517.
|
|
[21]
|
Mithun, N.C., Panda, R., Papalexakis, E.E., et al. (2018) Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval. Proceedings of the 26th ACM International Conference on Multimedia, Seoul, 22-26 October 2018, 1856-1864. [Google Scholar] [CrossRef]
|
|
[22]
|
Carreira, J. and Zisserman, A. (2017) Quo Vadis, Action Recogni-tion? A New Model and the Kinetics Dataset. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 6299-6308. [Google Scholar] [CrossRef]
|
|
[23]
|
Lin, T.Y., Maire, M., Belongie, S., et al. (2014) Microsoft COCO: Common Objects in Context. In: European Conference on Computer Vision, Springer, Cham, 740-755. [Google Scholar] [CrossRef]
|
|
[24]
|
Thompson, B. (2000) Canonical Correlation Analysis. In: Grimm, L.G. and Yarnold, P.R., Eds., Reading and Understanding MORE Multivariate Statistics, American Psycholog-ical Association, Washington DC, 285-316.
|
|
[25]
|
Hwang, S.J. and Grauman, K. (2012) Learning the Relative Importance of Objects from Tagged Images for Retrieval and Cross-Modal Search. International Journal of Computer Vision, 100, 134-153. [Google Scholar] [CrossRef]
|
|
[26]
|
Andrew, G., Arora, R., Bilmes, J., et al. (2013) Deep Canonical Correlation Analysis. International Conference on Machine Learning, Atlanta, 17-19 June 2013, 1247-1255.
|
|
[27]
|
Prétet, L., Richard, G. and Peeters, G. (2020) Learning to Rank Music Tracks Using Triplet Loss. ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, 4-8 May 2020, 511-515. [Google Scholar] [CrossRef]
|
|
[28]
|
Pons, J. and Serra, X. (2019) Musicnn: Pre-Trained Convolutional Neural Networks for Music Audio Tagging.
|
|
[29]
|
Prétet, L., Richard, G. and Peeters, G. (2021) Cross-Modal Music-Video Recommendation: A Study of Design Choices. 2021 International Joint Conference on Neu-ral Networks (IJCNN) IEEE, Shenzhen, 18-22 July 2021, 1-9. [Google Scholar] [CrossRef]
|