[1]
|
Alaba, S.Y. and Ball, J.E. (2024) Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous Driving. IEEE Access, 12, 50165-50176. [Google Scholar] [CrossRef]
|
[2]
|
Bolt, R.A. (1980) “Put-That-There”: Voice and Gesture at the Graphics Interface. Proceedings of the 7th Annual Conference on Computer Graphics and Interactive Techniques, Seattle, 14-18 July 1980, 262-270. [Google Scholar] [CrossRef]
|
[3]
|
Nilsson, N.J. (1984) Shakey the Robot. https://www.semanticscholar.org/paper/Shakey-the-Robot-Nilsson/476ba2a1c5204d46e420506afacb4b0da6abb868
|
[4]
|
Nefian, A.V., Liang, L., Pi, X., Liu, X. and Murphy, K. (2002) Dynamic Bayesian Networks for Audio-Visual Speech Recognition. EURASIP Journal on Advances in Signal Processing, 2002, 1-15. [Google Scholar] [CrossRef]
|
[5]
|
Bengio, S. (2003) Multimodal Authentication Using Asynchronous HMMs. In: Lecture Notes in Computer Science, Springer, 770-777. [Google Scholar] [CrossRef]
|
[6]
|
Reiter, S., Schuller, B. and Rigoll, G. (2007) Hidden Conditional Random Fields for Meeting Segmentation. 2007 IEEE International Conference on Multimedia and Expo, Beijing, 2-5 July 2007, 639-642. [Google Scholar] [CrossRef]
|
[7]
|
Baltrusaitis, T., Banda, N. and Robinson, P. (2013) Dimensional Affect Recognition Using Continuous Conditional Random Fields. 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Shanghai, 22-26 April 2013, 1-8. [Google Scholar] [CrossRef]
|
[8]
|
Blum, A. and Mitchell, T. (1998) Combining Labeled and Unlabeled Data with Co-Training. Proceedings of the Eleventh Annual Conference on Computational Learning Theory, Madison, 24-26 July 1998, 92-100. [Google Scholar] [CrossRef]
|
[9]
|
Guillaumin, M., Verbeek, J. and Schmid, C. (2010) Multimodal Semi-Supervised Learning for Image Classification. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, 13-18 June 2010, 902-909.
|
[10]
|
Poh, N., Kittler, J. and Rattani, A. (2014) Handling Session Mismatch by Fusion-Based Co-Training: An Empirical Study Using Face and Speech Multimodal Biometrics. 2014 IEEE Symposium on Computational Intelligence in Biometrics and Identity Management (CIBIM), Orlando, 9-12 December 2014, 81-86. [Google Scholar] [CrossRef]
|
[11]
|
Chakravarty, P., Zegers, J., Tuytelaars, T. and Van Hamme, H. (2016) Active Speaker Detection with Audio-Visual Co-Training. Proceedings of the 18th ACM International Conference on Multimodal Interaction, Tokyo, 12-16 November 2016, 312-316. [Google Scholar] [CrossRef]
|
[12]
|
Sikka, K., Dykstra, K., Sathyanarayana, S., Littlewort, G. and Bartlett, M. (2013) Multiple Kernel Learning for Emotion Recognition in the Wild. Proceedings of the 15th ACM on International Conference on Multimodal Interaction, Sydney, 9-13 December 2013, 517-524. [Google Scholar] [CrossRef]
|
[13]
|
Liu, F., Zhou, L., Shen, C., et al. (2013) Multiple Kernel Learning in the Primal for Multimodal Alzheimer’s Disease Classification. IEEE Journal of Biomedical and Health Informatics, 18, 984-990. [Google Scholar] [CrossRef] [PubMed]
|
[14]
|
Shaik, T., Tao, X., Li, L., Xie, H. and Velásquez, J.D. (2024) A Survey of Multimodal Information Fusion for Smart Healthcare: Mapping the Journey from Data to Wisdom. Information Fusion, 102, Article ID: 102040. [Google Scholar] [CrossRef]
|
[15]
|
Sargin, M.E., Yemez, Y., Erzin, E. and Tekalp, A.M. (2007) Audiovisual Synchronization and Fusion Using Canonical Correlation Analysis. IEEE Transactions on Multimedia, 9, 1396-1403. [Google Scholar] [CrossRef]
|
[16]
|
Correa, N.M., Li, Y.O., Adali, T., et al. (2008) Canonical Correlation Analysis for Feature-Based Fusion of Biomedical Imaging Modalities and Its Application to Detection of Associative Networks in Schizophrenia. IEEE Journal of Selected Topics in Signal Processing, 2, 998-1007. [Google Scholar] [CrossRef] [PubMed]
|
[17]
|
Gao, L., Zhang, R., Qi, L., Chen, E. and Guan, L. (2018) The Labeled Multiple Canonical Correlation Analysis for Information Fusion. IEEE Transactions on Multimedia, 21, 375-387. [Google Scholar] [CrossRef]
|
[18]
|
Ramachandram, D. and Taylor, G.W. (2017) Deep Multimodal Learning: A Survey on Recent Advances and Trends. IEEE Signal Processing Magazine, 34, 96-108. [Google Scholar] [CrossRef]
|
[19]
|
Ngiam, J., Khosla, A., Kim, M., et al. (2011) Multimodal Deep Learning. 2011 International Conference on Machine Learning, Bellevue, 28 June-2 July 2011, 689-696.
|
[20]
|
Nguyen, T.L., Kavuri, S. and Lee, M. (2019) A Multimodal Convolutional Neuro-Fuzzy Network for Emotion Understanding of Movie Clips. Neural Networks, 118, 208-219. [Google Scholar] [CrossRef] [PubMed]
|
[21]
|
Joze, H.R.V., Shaban, A., Iuzzolino, M.L., et al. (2020) MMTM: Multimodal Transfer Module for CNN Fusion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 13-19 June 2020, 13289-13299.
|
[22]
|
Kuttala, R., Subramanian, R. and Oruganti, V.R.M. (2023) Multimodal Hierarchical CNN Feature Fusion for Stress Detection. IEEE Access, 11, 6867-6878. [Google Scholar] [CrossRef]
|
[23]
|
Rastgoo, M.N., Nakisa, B., Maire, F., Rakotonirainy, A. and Chandran, V. (2019) Automatic Driver Stress Level Classification Using Multimodal Deep Learning. Expert Systems with Applications, 138, Article ID: 112793. [Google Scholar] [CrossRef]
|
[24]
|
Geethanjali, R. and Valarmathi, A. (2024) A Novel Hybrid Deep Learning IChOA-CNN-LSTM Model for Modality-Enriched and Multilingual Emotion Recognition in Social Media. Scientific Reports, 14, Article No. 2270. [Google Scholar] [CrossRef] [PubMed]
|
[25]
|
Hosseini, S.S., Yamaghani, M.R. and Poorzaker Arabani, S. (2024) Multimodal Modelling of Human Emotion Using Sound, Image and Text Fusion. Signal, Image and Video Processing, 18, 71-79. [Google Scholar] [CrossRef]
|
[26]
|
Sano, A., Chen, W., Lopez-Martinez, D., Taylor, S. and Picard, R.W. (2018) Multimodal Ambulatory Sleep Detection Using LSTM Recurrent Neural Networks. IEEE Journal of Biomedical and Health Informatics, 23, 1607-1617. [Google Scholar] [CrossRef] [PubMed]
|
[27]
|
Narayanan, A., Siravuru, A. and Dariush, B. (2019) Temporal Multimodal Fusion for Driver Behavior Prediction Tasks Using Gated Recurrent Fusion Units. https://openreview.net/forum?id=9PkIjGpDul
|
[28]
|
Amiriparian, S., Christ, L., Kathan, A., et al. (2024) The Muse 2024 Multimodal Sentiment Analysis Challenge: Social Perception and Humor Recognition. Proceedings of the 5th on Multimodal Sentiment Analysis Challenge and Workshop: Social Perception and Humor, Melbourne, 28 October 2024, 1-9.
|
[29]
|
Qin, Z., Luo, Q., Zang, Z. and Fu, H. (2025) Multimodal GRU with Directed Pairwise Cross-Modal Attention for Sentiment Analysis. Scientific Reports, 15, Article No. 10112. [Google Scholar] [CrossRef] [PubMed]
|
[30]
|
Praveen, R.G., Granger, E. and Cardinal, P. (2023) Recursive Joint Attention for Audio-Visual Fusion in Regression Based Emotion Recognition. 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, 4-10 June 2023, 1-5.
|
[31]
|
Chen, T., Hong, R., Guo, Y., Hao, S. and Hu, B. (2022) MS²-GNN: Exploring GNN-Based Multimodal Fusion Network for Depression Detection. IEEE Transactions on Cybernetics, 53, 7749-7759. [Google Scholar] [CrossRef] [PubMed]
|
[32]
|
Zhao, F., Zhang, C. and Geng, B. (2024) Deep Multimodal Data Fusion. ACM Computing Surveys, 56, 1-36. [Google Scholar] [CrossRef]
|
[33]
|
Hu, J., Liu, Y., Zhao, J. and Jin, Q. (2021) MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, August 2021, 5666-5675. [Google Scholar] [CrossRef]
|
[34]
|
Ma, X., Ning, F., Xu, X., Shan, J., Li, H., Tian, X., et al. (2024) Survival Prediction for Non-Small Cell Lung Cancer Based on Multimodal Fusion and Deep Learning. IEEE Access, 12, 123236-123249. [Google Scholar] [CrossRef]
|
[35]
|
Ding, C., Sun, S. and Zhao, J. (2023) MST-GAT: A Multimodal Spatial-Temporal Graph Attention Network for Time Series Anomaly Detection. Information Fusion, 89, 527-536. [Google Scholar] [CrossRef]
|
[36]
|
Liang, S., Zhu, A., Zhang, J. and Shao, J. (2023) Hyper-Node Relational Graph Attention Network for Multi-Modal Knowledge Graph Completion. ACM Transactions on Multimedia Computing, Communications, and Applications, 19, 1-21. [Google Scholar] [CrossRef]
|
[37]
|
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. Advances in Neural Information Processing Systems, 30, 5998-6008.
|
[38]
|
Sun, H., Liu, J., Chai, S., Qiu, Z., Lin, L., Huang, X., et al. (2021) Multi-Modal Adaptive Fusion Transformer Network for the Estimation of Depression Level. Sensors, 21, Article 4764. [Google Scholar] [CrossRef] [PubMed]
|
[39]
|
Tsai, Y.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L. and Salakhutdinov, R. (2019) Multimodal Transformer for Unaligned Multimodal Language Sequences. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, July 2019, 6558-6569. [Google Scholar] [CrossRef] [PubMed]
|
[40]
|
Roy, S.K., Deria, A., Hong, D., Rasti, B., Plaza, A. and Chanussot, J. (2023) Multimodal Fusion Transformer for Remote Sensing Image Classification. IEEE Transactions on Geoscience and Remote Sensing, 61, 1-20. [Google Scholar] [CrossRef]
|
[41]
|
Tian, Y., Wang, Z., Chen, D., et al. (2024) TriCAFFNet: A Tri-Cross-Attention Transformer with a Multi-Feature Fusion Network for Facial Expression Recognition. Sensors, 24, Article 5391.
|
[42]
|
Zhao, B., Gong, M. and Li, X. (2022) Hierarchical Multimodal Transformer to Summarize Videos. Neurocomputing, 468, 360-369. [Google Scholar] [CrossRef]
|
[43]
|
Sun, C., Myers, A., Vondrick, C., Murphy, K. and Schmid, C. (2019) VideoBERT: A Joint Model for Video and Language Representation Learning. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 7463-7472. [Google Scholar] [CrossRef]
|
[44]
|
Lei, J., Li, L., Zhou, L., Gan, Z., Berg, T.L., Bansal, M., et al. (2021) Less Is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 7331-7341. [Google Scholar] [CrossRef]
|
[45]
|
Team, G., Anil, R., Borgeaud, S., et al. (2023) Gemini: A Family of Highly Capable Multimodal Models. arXiv: 2312.11805.
|
[46]
|
Team, C. (2024) Chameleon: Mixed-Modal Early-Fusion Foundation Models. arXiv: 2405.09818.
|
[47]
|
Xu, P., Zhu, X. and Clifton, D.A. (2023) Multimodal Learning with Transformers: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 12113-12132. [Google Scholar] [CrossRef] [PubMed]
|
[48]
|
Zhuge, M., Gao, D., Fan, D., Jin, L., Chen, B., Zhou, H., et al. (2021) Kaleido-BERT: Vision-Language Pre-Training on Fashion Domain. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 12642-12652. [Google Scholar] [CrossRef]
|