基于跨注意力机制特征融合的多模态情绪识别

doi:10.12677/jisp.2025.142016

期刊菜单

基于跨注意力机制特征融合的多模态情绪识别
Multimodal Emotion Recognition Based on Feature Fusion with Cross-Attention Mechanism

DOI: 10.12677/jisp.2025.142016, PDF,
作者: 吴铎：北方工业大学电气与控制工程学院，北京
关键词: 情绪识别；卷积神经网络；Transformer；Emotion Recognition； Convolutional Neural Network； Transformer

摘要: 情绪是人类与环境互动中产生的一种心理状态，它会影响认知、社交互动和幸福感。本研究使用了IEMOCAP数据库，聚焦于现实生活中的情绪表达。经过对音频、文本、视频数据的预处理，提取了语音、文本、和面部表情等特征，并进行了时间对齐和位置编码。随后，利用Transformer的交叉注意力机制将这些特征融合，以捕捉时间序列的变化并识别四种情绪类别。仿真结果验证了该模型的高效性，并且与其他基于IEMOCAP的模型相比，展示了更优的识别精度。

Abstract: Emotions are a psychological state that emerges from the interaction between humans and the environment, which can influence cognition, social interaction, and well-being. This study utilizes the IEMOCAP database, focusing on real-life emotional expressions. After pre-processing the audio, text, and video data, features such as speech, text, and facial expressions are extracted, and time alignment and position encoding are carried out. Subsequently, the cross-attention mechanism of Transformer is employed to fuse these features to capture the changes in the time series and identify four emotion categories. The simulation results verify the high efficiency of this model, and it demonstrates superior recognition accuracy compared with other models based on IEMOCAP.

文章引用：吴铎. 基于跨注意力机制特征融合的多模态情绪识别[J]. 图像与信号处理, 2025, 14(2): 162-172. https://doi.org/10.12677/jisp.2025.142016

参考文献

[1]	Yu, C.L., Shi, Z.Y. and Xie, Y.H. (2021) Sentiment Analysis and Stock Price Prediction System Based on Natural Language Processing. Systems Engineering, 39, 114-123.
[2]	Zhu, H., Mei, Y., Wei, J. and Shen, C. (2020) Prediction of Online Topics’ Popularity Patterns. Journal of Information Science, 48, 141-151. [Google Scholar] [CrossRef]
[3]	Yang, Y. (2017) Research and Realization of Internet Public Opinion Analysis Based on Improved TF-IDF Algorithm. 2017 16th International Symposium on Distributed Computing and Applications to Business, Engineering and Science (DCABES), Anyang, 13-16 October 2017, 80-83. [Google Scholar] [CrossRef]
[4]	Cheng, T.S. and Quan, H. (2022) Analysis of the Causes of Coal Mine Gas Accidents Based on Text Mining. Coal Mine Safety, 53, 241-245.
[5]	Peng, L.J., Shao, X.G. and Huang, W.M. (2021) Research on the Early-Warning Model of Network Public Opinion of Major Emergencies. IEEE Access, 9, 44162-44172. [Google Scholar] [CrossRef]
[6]	Aravind, R., Ashwin, G. and Srinivasan, N. (2024) AI Enhanced Video Sequence Description Generator. 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, 18-19 April 2024, 1-6. [Google Scholar] [CrossRef]
[7]	Xiang, J. and Zhu, G. (2017) Joint Face Detection and Facial Expression Recognition with MTCNN. 2017 4th International Conference on Information Science and Control Engineering (ICISCE), Changsha, 21-23 July 2017, 424-427. [Google Scholar] [CrossRef]
[8]	Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I. (2017) Attention Is All You Need. arXiv: 1706.03762.
[9]	Zhang, Y., Li, X., Rong, L. and Tiwari, P. (2021) Multi-Task Learning for Jointly Detecting Depression and Emotion. 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Houston, 9-12 December 2021, 3142-3149. [Google Scholar] [CrossRef]
[10]	Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y. and Liu, W. (2019) CCNet: Criss-Cross Attention for Semantic Segmentation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 603-612. [Google Scholar] [CrossRef]
[11]	Busso, C., Bulut, M., Lee, C., Kazemzadeh, A., Mower, E., Kim, S., et al. (2008) IEMOCAP: Interactive Emotional Dyadic Motion Capture Database. Language Resources and Evaluation, 42, 335-359. [Google Scholar] [CrossRef]
[12]	Tsai, Y.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L. and Salakhutdinov, R. (2019) Multimodal Transformer for Unaligned Multimodal Language Sequences. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, 28 July-2 August 2019, 6558-6569. [Google Scholar] [CrossRef] [PubMed]
[13]	Poria, S., Cambria, E., Hazarika, D., Majumder, N., Zadeh, A. and Morency, L. (2017) Context-Dependent Sentiment Analysis in User-Generated Videos. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, 30 July-4 August 2017, 873-883. [Google Scholar] [CrossRef]
[14]	Liang, P.P., Liu, Z., Bagher Zadeh, A. and Morency, L. (2018) Multimodal Language Analysis with Recurrent Multistage Fusion. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, 31 October-4 November 2018, 150-161. [Google Scholar] [CrossRef]
[15]	Wang, Y., Shen, Y., Liu, Z., Liang, P.P., Zadeh, A. and Morency, L. (2019) Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 7216-7223. [Google Scholar] [CrossRef]
[16]	Pham, H., Liang, P.P., Manzini, T., Morency, L. and Póczos, B. (2019) Found in Translation: Learning Robust Joint Representations by Cyclic Translations between Modalities. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 6892-6899. [Google Scholar] [CrossRef]
[17]	Mittal, T., Bhattacharya, U., Chandra, R., Bera, A. and Manocha, D. (2020) M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 1359-1367. [Google Scholar] [CrossRef]

友情链接