基于聚类优化与跨模态协同的多模态情感识别
Clustering-Based Optimization and Cross-Modal Collaborative Learning for Multimodal Sentiment Analysis
DOI: 10.12677/mos.2026.155071, PDF,    科研立项经费支持
作者: 徐金麟, 魏 赟:上海理工大学光电信息与计算机工程学院,上海
关键词: 多模态情感识别跨模态协同学习交叉注意力Multimodal Emotion Recognition Cross-Modal Collaborative Learning Cross-Attention
摘要: 多模态情感识别旨在通过融合文本、视频和音频等多种模态信息实现更准确的情感理解。现有方法在捕获跨模态互补信息时存在冗余问题,且难以建立有效的跨模态情感关联。研究提出了一种分层协同学习框架,通过特征聚类优化与跨模态协同学习机制解决上述问题。该方法首先采用聚类算法对多模态特征进行分组优化,结合注意力权重分配机制降低冗余并突出显著特征;随后设计跨模态协同学习模块,利用交叉注意力机制实现文本引导的初步学习以及音频与视频模态的相互引导学习,从而增强多模态表示能力。在MOSI和MOSEI两个公开数据集上的实验结果表明,所提方法在多个指标上取得具有竞争力或领先的性能,验证了该方法在提升多模态情感识别性能方面的有效性。
Abstract: Multimodal emotion recognition aims to achieve more accurate emotion recognition by integrating information from multiple modalities, such as text, video, and audio. However, existing methods often suffer from feature redundancy when capturing cross-modal complementary information and struggle to establish effective cross-modal emotional correlations. To address these challenges, we propose a Hierarchical Collaborative Learning framework that combines clustering optimization with cross-modal collaborative learning. Specifically, a clustering algorithm is first applied to optimize the grouping of multimodal features, together with an attention-based weight allocation mechanism, to reduce redundant information and emphasize salient features. Subsequently, a cross-modal collaborative learning module is designed to perform text-guided initial learning and mutual guidance learning between audio and visual modalities through a cross-attention mechanism, thereby enhancing multimodal representation ability. Experimental results on the public MOSI and MOSEI datasets demonstrate that the proposed method achieves competitive performance across multiple evaluation metrics, validating the effectiveness of the proposed method in improving multimodal emotion recognition performance.
文章引用:徐金麟, 魏赟. 基于聚类优化与跨模态协同的多模态情感识别[J]. 建模与仿真, 2026, 15(5): 60-71. https://doi.org/10.12677/mos.2026.155071

参考文献

[1] Salloum, S., Alhumaid, K., Salloum, A. and Shaalan, K. (2024) Disease Discourse through Sentiment and Network Analysis. Procedia Computer Science, 244, 23-29. [Google Scholar] [CrossRef
[2] Cui, Y., Yu, H., Guo, X., Cao, H. and Wang, L. (2024) RAKCR: Reviews Sentiment-Aware Based Knowledge Graph Convolutional Networks for Personalized Recommendation. Expert Systems with Applications, 248, Article 123403. [Google Scholar] [CrossRef
[3] Xu, N., Mao, W. and Chen, G. (2019) Multi-Interactive Memory Network for Aspect Based Multimodal Sentiment Analysis. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 371-378. [Google Scholar] [CrossRef
[4] Zhang, H., Wang, Y., Yin, G., Liu, K., Liu, Y. and Yu, T. (2023) Learning Language-Guided Adaptive Hyper-Modality Representation for Multimodal Sentiment Analysis. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6-10 December 2023, 756-767. [Google Scholar] [CrossRef
[5] Guo, Z., Ma, H. and Li, A. (2025) A Lightweight Finger Multimodal Recognition Model Based on Detail Optimization and Perceptual Compensation Embedding. Computer Standards & Interfaces, 92, Article 103937. [Google Scholar] [CrossRef
[6] Fu, Y., Huang, B., Wen, Y. and Zhang, P. (2024) FDR-MSA: Enhancing Multimodal Sentiment Analysis through Feature Disentanglement and Reconstruction. Knowledge-Based Systems, 297, Article 111965. [Google Scholar] [CrossRef
[7] Li, Z., Huang, Z., Pan, Y., Yu, J., Liu, W., Chen, H., et al. (2024) Hierarchical Denoising Representation Disentanglement and Dual-Channel Cross-Modal-Context Interaction for Multimodal Sentiment Analysis. Expert Systems with Applications, 252, Article 124236. [Google Scholar] [CrossRef
[8] Park, S., Shim, H.S., Chatterjee, M., Sagae, K. and Morency, L. (2016) Multimodal Analysis and Prediction of Persuasiveness in Online Social Multimedia. ACM Transactions on Interactive Intelligent Systems, 6, 1-25. [Google Scholar] [CrossRef
[9] Xu, N. and Mao, W. (2017) MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6-10 November 2017, 2399-2402. [Google Scholar] [CrossRef
[10] Liu, Z., Cai, L., Yang, W. and Liu, J. (2024) Sentiment Analysis Based on Text Information Enhancement and Multimodal Feature Fusion. Pattern Recognition, 156, Article 110847. [Google Scholar] [CrossRef
[11] Huang, C., Zhang, J., Wu, X., Wang, Y., Li, M. and Huang, X. (2023) TEFNA: Text-Centered Fusion Network with Crossmodal Attention for Multimodal Sentiment Analysis. Knowledge-Based Systems, 269, Article 110502. [Google Scholar] [CrossRef
[12] Ahmad, K.M., Liu, Q., Khalil, M.M.Y., Gan, Y., Khan, A.A., Liu, X., et al. (2024) Aspect-Specific Parsimonious Segmentation via Attention-Based Graph Convolutional Network for Aspect-Based Sentiment Analysis. Knowledge-Based Systems, 300, Article 112169. [Google Scholar] [CrossRef
[13] Wang, Y., He, J., Wang, D., Wang, Q., Wan, B. and Luo, X. (2024) Multimodal Transformer with Adaptive Modality Weighting for Multimodal Sentiment Analysis. Neurocomputing, 572, Article 127181. [Google Scholar] [CrossRef
[14] Zadeh, A., Liang, P.P., Poria, S., Vij, P., Cambria, E. and Morency, L. (2018) Multi-Attention Recurrent Network for Human Communication Comprehension. Proceedings of the AAAI Conference on Artificial Intelligence, 32, 5642-5649. [Google Scholar] [CrossRef
[15] Tsai, Y.H.H., Liang, P.P., Zadeh, A., et al. (2019) Learning Factorized Multimodal Representations. arXiv: 1806.06176.
[16] Hazarika, D., Zimmermann, R. and Poria, S. (2020) MISA: Modality-Invariant and-Specific Representations for Multimodal Sentiment Analysis. Proceedings of the 28th ACM International Conference on Multimedia, Seattle, 12-16 October 2020, 1122-1131. [Google Scholar] [CrossRef
[17] Yang, D., Huang, S., Kuang, H., Du, Y. and Zhang, L. (2022) Disentangled Representation Learning for Multimodal Emotion Recognition. Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, 10-14 October 2022, 1642-1651. [Google Scholar] [CrossRef
[18] Wang, J., Wang, S., Lin, M., Xu, Z. and Guo, W. (2023) Learning Speaker-Independent Multimodal Representation for Sentiment Analysis. Information Sciences, 628, 208-225. [Google Scholar] [CrossRef
[19] Tang, Z., Xiao, Q., Zhou, X., Li, Y., Chen, C. and Li, K. (2023) Learning Discriminative Multi-Relation Representations for Multimodal Sentiment Analysis. Information Sciences, 641, Article 119125. [Google Scholar] [CrossRef
[20] Li, M., Zhu, Z., Li, K., Zhou, L., Zhao, Z. and Pei, H. (2024) Joint Training Strategy of Unimodal and Multimodal for Multimodal Sentiment Analysis. Image and Vision Computing, 149, Article 105172. [Google Scholar] [CrossRef
[21] Huang, J., Zhou, J., Tang, Z., Lin, J. and Chen, C.Y. (2024) TMBL: Transformer-Based Multimodal Binding Learning Model for Multimodal Sentiment Analysis. Knowledge-Based Systems, 285, Article 111346. [Google Scholar] [CrossRef
[22] Gan, C., Tang, Y., Fu, X., Zhu, Q., Jain, D.K. and García, S. (2024) Video Multimodal Sentiment Analysis Using Cross-Modal Feature Translation and Dynamical Propagation. Knowledge-Based Systems, 299, Article 111982. [Google Scholar] [CrossRef
[23] Wang, P., Zhou, Q., Wu, Y., Chen, T. and Hu, J. (2025) DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis. Proceedings of the AAAI Conference on Artificial Intelligence, 39, 21180-21188. [Google Scholar] [CrossRef