CMPTA:预训练大模型在多模态情感分析任务中的应用研究
CMPTA: Exploring the Application of Pre-Trained Large Language Models in Multimodal Sentiment Analysis
DOI: 10.12677/csa.2026.161023, PDF,    科研立项经费支持
作者: 李志豪, 智 宇:温州大学计算机与人工智能学院,元宇宙与人工智能研究中心,浙江 温州;陈 昂:温州大学计算机与人工智能学院,元宇宙与人工智能研究中心,浙江 温州;温州大学元宇宙与人工智能研究院,浙江 温州
关键词: 多模态情感分析大语言模型伪Token参数高效微调跨模态适配器Multimodal Sentiment Analysis Large Language Models Pseudo-Token Parameter-Efficient Fine-Tuning (PEFT) Cross-Modal Adapter
摘要: 大语言模型(LLMs)在自然语言处理领域取得了显著进展,但将其有效迁移至多模态情感分析(MSA)任务仍面临巨大挑战。主要难点在于如何弥合异构模态(如视觉、音频)特征与预训练文本大模型语义空间之间的鸿沟。现有方法多依赖复杂的深度融合网络或昂贵的全量微调,难以充分利用大模型的推理与泛化能力。为此,本文提出了一种轻量级的跨模态伪Token适配器(Cross-Modal Pseudo-Token Adapter, CMPTA)。该方法并不破坏大模型的原有参数,而是通过高效的注意力机制,将非文本模态特征转化为LLM可理解的“伪Token”(Pseudo-Tokens),并以软提示(Soft Prompts)的形式注入文本输入序列,从而实现多模态信息与文本语义的深度对齐。此外,本文还系统探究了伪Token数量对模型语义对齐效果的影响规律。实验结果表明,CMPTA能够有效激发大模型的多模态情感理解能力,其性能优于当前的先进基线方法,验证了该框架的有效性与泛化能力。
Abstract: Large Language Models (LLMs) have achieved remarkable progress in Natural Language Processing, yet effectively adapting them to Multimodal Sentiment Analysis (MSA) tasks remains a significant challenge. The core difficulty lies in bridging the gap between heterogeneous modal features (e.g., visual, acoustic) and the semantic space of pre-trained text models. Existing approaches often rely on complex deep fusion networks or expensive full fine-tuning, failing to fully leverage the reasoning and generalization capabilities of LLMs. To address this, we propose a lightweight Cross-Modal Pseudo-Token Adapter (CMPTA). Instead of disrupting the original parameters of the LLM, this method employs an efficient attention mechanism to transform non-textual modal features into “Pseudo-Tokens” understandable by the LLM. These tokens are then injected into the text input sequence as Soft Prompts, achieving deep alignment between multimodal information and textual semantics. Furthermore, we systematically investigate the impact of the number of pseudo-tokens on semantic alignment. Experimental results demonstrate that CMPTA effectively stimulates the multimodal sentiment understanding capability of LLMs, outperforming state-of-the-art baselines, thereby validating the effectiveness and generalization ability of the framework.
文章引用:李志豪, 智宇, 陈昂. CMPTA:预训练大模型在多模态情感分析任务中的应用研究[J]. 计算机科学与应用, 2026, 16(1): 281-294. https://doi.org/10.12677/csa.2026.161023

参考文献

[1] Gandhi, A., Adhvaryu, K., Poria, S., Cambria, E. and Hussain, A. (2023) Multimodal Sentiment Analysis: A Systematic Review of History, Datasets, Multimodal Fusion Methods, Applications, Challenges and Future Directions. Information Fusion, 91, 424-444. [Google Scholar] [CrossRef
[2] Li, J., Wang, X., Liu, Y. and Zeng, Z. (2024) CFN-ESA: A Cross-Modal Fusion Network with Emotion-Shift Awareness for Dialogue Emotion Recognition. IEEE Transactions on Affective Computing, 15, 1919-1933. [Google Scholar] [CrossRef
[3] Pan, B., Hirota, K., Jia, Z. and Dai, Y. (2023) A Review of Multimodal Emotion Recognition from Datasets, Preprocessing, Features, and Fusion Methods. Neurocomputing, 561, Article 126866. [Google Scholar] [CrossRef
[4] Yuan, Y., Li, Z. and Zhao, B. (2025) A Survey of Multimodal Learning: Methods, Applications, and Future. ACM Computing Surveys, 57, 1-34. [Google Scholar] [CrossRef
[5] Yang, J., Yu, Y., Niu, D., Guo, W. and Xu, Y. (2023) ConFEDE: Contrastive Feature Decomposition for Multimodal Sentiment Analysis. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, 9-14 July 2023, 7617-7630. [Google Scholar] [CrossRef
[6] Hu, Z., Wang, L., Lan, Y., Xu, W., Lim, E., Bing, L., et al. (2023) LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6-10 December 2023, 5254-5276. [Google Scholar] [CrossRef
[7] Hyeon, J., Oh, Y., Lee, Y. and Choi, H. (2025) Enhancing Speech Emotion Recognition through Segmental Average Pooling of Self-Supervised Learning Features. 2025 IEEE International Conference on Big Data and Smart Computing (BigComp), Kota Kinabalu, 9-12 February 2025, 191-198. [Google Scholar] [CrossRef
[8] Lian, Z., Sun, H., Sun, L., Wen, Z., Zhang, S., Chen, S., et al. (2024) MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition. Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing, Melbourne, 28 October 2024-1 November 2024, 41-48. [Google Scholar] [CrossRef
[9] Ma, H., Wang, J., Lin, H., Zhang, B., Zhang, Y. and Xu, B. (2024) A Transformer-Based Model with Self-Distillation for Multimodal Emotion Recognition in Conversations. IEEE Transactions on Multimedia, 26, 776-788. [Google Scholar] [CrossRef
[10] Zhao, H., Ju, Y. and Gao, Y. (2024) Bilevel Relational Graph Representation Learning-Based Multimodal Emotion Recognition in Conversation. 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, 15-19 July 2024, 1-6. [Google Scholar] [CrossRef
[11] Lian, Z., Chen, L., Sun, L., Liu, B. and Tao, J. (2023) GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 8419-8432. [Google Scholar] [CrossRef] [PubMed]
[12] Zhang, D., Chen, F. and Chen, X. (2023) DualGATs: Dual Graph Attention Networks for Emotion Recognition in Conversations. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, 9-14 July 2023, 7395-7408. [Google Scholar] [CrossRef
[13] Zou, H., Lv, F., Zheng, D., Chng, E.S. and Rajan, D. (2025) Large Language Models Meet Contrastive Learning: Zero-Shot Emotion Recognition across Languages. 2025 IEEE International Conference on Multimedia and Expo (ICME), Nantes, 30 June 2025-4 July 2025, 1-6. [Google Scholar] [CrossRef
[14] Wang, L., Yang, J., Wang, Y., Qi, Y., Wang, S. and Li, J. (2024) Integrating Large Language Models (LLMs) and Deep Representations of Emotional Features for the Recognition and Evaluation of Emotions in Spoken English. Applied Sciences, 14, Article 3543. [Google Scholar] [CrossRef
[15] Kadiyala, R.M.R. (2024) Cross-Lingual Emotion Detection through Large Language Models. Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, Bangkok, 15 August 2024, 464-469. [Google Scholar] [CrossRef
[16] Devlin, J., Chang, M.W., Lee, K. et al. (2019) BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, 2 June-7 June 2019, 4171-4186.
[17] Touvron, H., Lavril, T., Izacard, G., et al. (2023) Llama: Open and Efficient Foundation Language Models. arXiv:2302.13971.
[18] Zhang, Y., Wang, M., Tiwari, P., Li, Q., Wang, B. and Qin, J. (2023) DialogueLLM: Context and Emotion Knowledge-Tuned LLaMA Models for Emotion Recognition in Conversations. arXiv:2310.11374.
[19] Cheng, Z., Cheng, Z., Hauptmann, A., He, J., Lian, Z., Lin, Y., et al. (2024) Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning. Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, 10-15 December 2024, 110805-110853. [Google Scholar] [CrossRef
[20] Aruna Gladys, A., Vetriselvi, V. and Rajasekar, S.K. (2024) Multimodal Emotion Cause Pair Extraction in Conversations Using Knowledge Distillation and Large Language Models. 2024 International Conference on Computational Intelligence and Network Systems (CINS), Dubai, 28-29 November 2024, 1-8. [Google Scholar] [CrossRef
[21] Georgiou, E., Katsouros, V., Avrithis, Y. and Potamianos, A. (2025) DeepMLF: Multimodal Language Model with Learnable Tokens for Deep Fusion in Sentiment Analysis. arXiv, arXiv:2504.11082.
[22] Dutta, S. and Ganapathy, S. (2025) LLM Supervised Pre-Training for Multimodal Emotion Recognition in Conversations. ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, 6-11 April 2025, 1-5. [Google Scholar] [CrossRef
[23] Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9, 1735-1780. [Google Scholar] [CrossRef] [PubMed]
[24] Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Bagher Zadeh, A. and Morency, L. (2018) Efficient Low-Rank Multimodal Fusion with Modality-Specific Factors. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, 15-20 July 2018, 2247-2256. [Google Scholar] [CrossRef
[25] Yu, W., Xu, H., Yuan, Z. and Wu, J. (2021) Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. Proceedings of the AAAI Conference on Artificial Intelligence, 35, 10790-10797. [Google Scholar] [CrossRef
[26] Rahman, W., Hasan, M.K., Lee, S., Bagher Zadeh, A., Mao, C., Morency, L., et al. (2020) Integrating Multimodal Information in Large Pretrained Transformers. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5-10 July 2020, 2359-2371. [Google Scholar] [CrossRef] [PubMed]
[27] Yang, Y., Dong, X. and Qiang, Y. (2025) MSE-Adapter: A Lightweight Plugin Endowing LLMs with the Capability to Perform Multimodal Sentiment Analysis and Emotion Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 39, 25642-25650. [Google Scholar] [CrossRef
[28] Zadeh, A., Chen, M., Poria, S., Cambria, E. and Morency, L. (2017) Tensor Fusion Network for Multimodal Sentiment Analysis. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, 7-11 September 2017, 1103-1114. [Google Scholar] [CrossRef
[29] Hu, J., Liu, Y., Zhao, J. and Jin, Q. (2021) MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1-6 August 2021, 5666-5675. [Google Scholar] [CrossRef
[30] Li, J., Wang, X., Lv, G. and Zeng, Z. (2024) GA2MIF: Graph and Attention Based Two-Stage Multi-Source Information Fusion for Conversational Emotion Detection. IEEE Transactions on Affective Computing, 15, 130-143. [Google Scholar] [CrossRef