基于分层距离感知对比学习的多模态情绪分析
Multimodal Sentiment Analysis Based on Hierarchical Distance-Aware Contrastive Learning
DOI: 10.12677/csa.2025.155134, PDF,    科研立项经费支持
作者: 吕欣阳*, 金媛媛, 韩 旭, 杨 明:沈阳城市建设学院信息与控制工程学院,辽宁 沈阳
关键词: 多模态情感分析跨模态注意力机制对比学习Multimodal Sentiment Analysis Cross-Modal Attention Mechanism Contrastive Learning
摘要: 多模态情感分析(multimodal sentiment analysis, MSA)利用视觉、文本和音频等模态数据来提升情感分析的准确性。尽管多模态信息能够提供更丰富的语境,但如何有效地处理异构模态数据之间的交互与融合仍然是一个重要挑战。为了解决这一问题,本文提出了一种基于分层距离感知对比学习(hierarchical distance-aware contrastive learning, HDACL)的多模态情感分析方法。具体而言,HDACL通过引入跨模态注意力机制,实现了不同模态数据之间的充分交互。与此同时,我们设计了一种基于情感强度距离差异引导的对比学习策略,进一步增强了多模态数据的一致性对齐。在CMU-MOSI多模态情感分析数据集上进行验证,实验结果表明,HDACL方法在Acc-2和Acc-7指标上分别取得了0.7%和0.8%的性能提升。
Abstract: Multimodal sentiment analysis (MSA) utilizes visual, textual, and audio data to improve the accuracy of sentiment analysis. Although multimodal information can provide richer context, how to effectively handle the interaction and fusion across heterogeneous multimodal data remains an important challenge. To this end, this paper proposes a multimodal sentiment analysis method based on hierarchical distance-aware contrastive learning (HDACL). Specifically, HDACL achieves full interaction across different modal data by introducing a cross-modal attention mechanism. Meanwhile we design a contrastive learning strategy guided by the difference in sentiment intensity distance to further enhance the consistency alignment of multimodal data. The method was validated on the CMU-MOSI multimodal sentiment analysis dataset. Experimental results show that the HDACL method achieved 0.7% and 0.8% performance improvements on the Acc-2 and Acc-7 indicators, respectively.
文章引用:吕欣阳, 金媛媛, 韩旭, 杨明. 基于分层距离感知对比学习的多模态情绪分析[J]. 计算机科学与应用, 2025, 15(5): 615-623. https://doi.org/10.12677/csa.2025.155134

参考文献

[1] Islam, M.S., Kabir, M.N., Ghani, N.A., Zamli, K.Z., Zulkifli, N.S.A., Rahman, M.M., et al. (2024) Challenges and Future in Deep Learning for Sentiment Analysis: A Comprehensive Review and a Proposed Novel Hybrid Approach. Artificial Intelligence Review, 57, Article No. 62. [Google Scholar] [CrossRef
[2] Poria, S., Hazarika, D., Majumder, N. and Mihalcea, R. (2023) Beneath the Tip of the Iceberg: Current Challenges and New Directions in Sentiment Analysis Research. IEEE Transactions on Affective Computing, 14, 108-132. [Google Scholar] [CrossRef
[3] Poria, S., Cambria, E., Bajpai, R. and Hussain, A. (2017) A Review of Affective Computing: From Unimodal Analysis to Multimodal Fusion. Information Fusion, 37, 98-125. [Google Scholar] [CrossRef
[4] Somandepalli, K., Guha, T., Martinez, V.R., Kumar, N., Adam, H. and Narayanan, S. (2021) Computational Media Intelligence: Human-Centered Machine Analysis of Media. Proceedings of the IEEE, 109, 891-910. [Google Scholar] [CrossRef
[5] 彭李湘松, 张著洪. 基于三角形特征融合与感知注意力的方面级情感分析[J]. 计算机工程, 2025: 1-10. 2025-03-25. [Google Scholar] [CrossRef
[6] Fan, C., Zhu, K., Tao, J., Yi, G., Xue, J. and Lv, Z. (2025) Multi-Level Contrastive Learning: Hierarchical Alleviation of Heterogeneity in Multimodal Sentiment Analysis. IEEE Transactions on Affective Computing, 16, 207-222. [Google Scholar] [CrossRef
[7] Tsai, Y.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L. and Salakhutdinov, R. (2019) Multimodal Transformer for Unaligned Multimodal Language Sequences. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, 28 July-2 August 2019, 6558-6569. [Google Scholar] [CrossRef] [PubMed]
[8] Li, Y., Wang, Y. and Cui, Z. (2023) Decoupled Multimodal Distilling for Emotion Recognition. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 6631-6640. [Google Scholar] [CrossRef
[9] Han, W., Chen, H. and Poria, S. (2021) Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 7-11 November 2021, 9180-9192. [Google Scholar] [CrossRef
[10] Wang, D., Liu, S., Wang, Q., Tian, Y., He, L. and Gao, X. (2023) Cross-Modal Enhancement Network for Multimodal Sentiment Analysis. IEEE Transactions on Multimedia, 25, 4909-4921. [Google Scholar] [CrossRef
[11] Taboada, M., Brooke, J., Tofiloski, M., Voll, K. and Stede, M. (2011) Lexicon-Based Methods for Sentiment Analysis. Computational Linguistics, 37, 267-307. [Google Scholar] [CrossRef
[12] 吴杰胜, 陆奎. 基于多部情感词典和规则集的中文微博情感分析研究[J]. 计算机应用与软件, 2019, 36(9): 93-99.
[13] Chang, C. and Lin, C. (2011) LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2, 1-27. [Google Scholar] [CrossRef
[14] 王彬, 蒋鸿玲, 吴槟. 基于Attention-Bi-LSTM的微博评论情感分析研究[J]. 计算机科学与应用, 2020, 10(12): 2380-2387.
[15] Naseem, U., Razzak, I., Musial, K. and Imran, M. (2020) Transformer Based Deep Intelligent Contextual Embedding for Twitter Sentiment Analysis. Future Generation Computer Systems, 113, 58-69. [Google Scholar] [CrossRef
[16] Fasel, B. and Luettin, J. (2003) Automatic Facial Expression Analysis: A Survey. Pattern Recognition, 36, 259-275. [Google Scholar] [CrossRef
[17] Li, J., Zhang, D., Zhang, J., Zhang, J., Li, T., Xia, Y., et al. (2017) Facial Expression Recognition with Faster R-CNN. Procedia Computer Science, 107, 135-140. [Google Scholar] [CrossRef
[18] Nancy, A.M., Kumar, G.S., Doshi, P. and Shaw, S. (2018) Audio Based Emotion Recognition Using Mel Frequency Cepstral Coefficient and Support Vector Machine. Journal of Computational and Theoretical Nanoscience, 15, 2255-2258. [Google Scholar] [CrossRef
[19] Koolagudi, S.G. and Rao, K.S. (2012) Emotion Recognition from Speech: A Review. International Journal of Speech Technology, 15, 99-117. [Google Scholar] [CrossRef
[20] Zadeh, A., Chen, M., Poria, S., Cambria, E. and Morency, L. (2017) Tensor Fusion Network for Multimodal Sentiment Analysis. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, 9-11 September 2017, 1103-1114. [Google Scholar] [CrossRef
[21] He, L., Wang, Z., Wang, L. and Li, F. (2023) Multimodal Mutual Attention-Based Sentiment Analysis Framework Adapted to Complicated Contexts. IEEE Transactions on Circuits and Systems for Video Technology, 33, 7131-7143. [Google Scholar] [CrossRef
[22] Hazarika, D., Zimmermann, R. and Poria, S. (2020) MISA: Modality-Invariant and-Specific Representations for Multimodal Sentiment Analysis. Proceedings of the 28th ACM International Conference on Multimedia, Seattle, 12-16 October 2020, 1122-1131. [Google Scholar] [CrossRef
[23] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., et al. (2020) Transformers: State-Of-The-Art Natural Language Processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 16-20 November 2020, 38-45. [Google Scholar] [CrossRef
[24] Zadeh, A., Zellers, R., Pincus, E. and Morency, L.P. (2016) MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. arXiv: 1606. 06259.