基于门控复归单位(GRU)和多头注意机制的语音情感识别模型

doi:10.12677/airr.2024.132038

期刊菜单

基于门控复归单位(GRU)和多头注意机制的语音情感识别模型
A Speech Emotion Recognition Model Based on Gated Recurrent Units (GRU) and Multi-Head Attention Mechanism

DOI: 10.12677/airr.2024.132038, PDF, 科研立项经费支持
作者: 郭凤婵, 吴毅良^*, 罗序良, 刘翠媚：广东电网有限责任公司江门供电局，广东江门
关键词: 语音情感识别(SER)；门控复归单位(GRU)；多头注意机制；Bi-GRU；深度学习；Speech Emotion Recognition (SER)； Gated Recurrent Units (GRU)； Multi-Head Attention Mechanism； Bi-GRU； Deep Learning

摘要: 本研究提出了一种基于门控复归单位(GRU)和多头注意机制的语音情感识别模型。随着人工智能和情感计算的进步，该模型旨在分析语音信号中的情感信息，以识别说话者的情感状态，包括喜怒哀乐等各种情感表达。这一技术在情感智能、智能客服和人机交互等领域有着广阔的应用前景。本研究结合了GRU的时序信息处理能力和多头注意机制对重要特征的关注度提升，构建了一个有效且精确的语音情感识别模型。实验结果表明，此模型在IEMOCAP和Emo-DB数据集上分别实现了81.04%和94.93%的未加权准确率，相较于已有模型有显著提升。此外，该模型还展现出良好的泛化性能和可扩展性，为智能语音交互、情感计算等领域提供了可靠的技术支持。

Abstract: This study proposes a speech emotion recognition model based on Gated Recurrent Units (GRU) and a multi-head attention mechanism. With the advancement of artificial intelligence and affective computing, the model aims to analyze emotional information in speech signals to identify the emotional states of speakers, encompassing various expressions such as joy, anger, sadness, and others. This technology holds broad application prospects in affective intelligence, intelligent customer service, and human-computer interaction. Integrating the temporal information processing capability of GRU and the elevated attention to crucial features by the multi-head attention mechanism, an effective and precise speech emotion recognition model is developed. Experimental results demonstrate that this model achieved an unweighted accuracy of 81.04% on the IEMOCAP dataset and 94.93% on the Emo-DB dataset, showing significant improvement compared to existing models. Additionally, the model exhibits good generalization performance and scalability, providing reliable technical support for intelligent speech interaction, affective computing, and related fields.

文章引用：郭凤婵, 吴毅良, 罗序良, 刘翠媚. 基于门控复归单位(GRU)和多头注意机制的语音情感识别模型[J]. 人工智能与机器人研究, 2024, 13(2): 363-374. https://doi.org/10.12677/airr.2024.132038

参考文献

[1]	耿磊, 傅洪亮, 陶华伟, 等. 基于动态卷积递归神经网络的语音情感识别[J]. 计算机工程, 2023, 49(4): 125-130. [Google Scholar] [CrossRef]
[2]	Tang, H., Zhang, X., Cheng, N., Xiao, J., Wang, J. (2024) ED-TTS: Multi-Scale Emotion Modeling Using Cross-Domain Emotion Diarization for Emotional Speech Synthesis. Seoul, 14-19 April 2024, 12146-12150. [Google Scholar] [CrossRef]
[3]	Zou, H., Si, Y., Chen, C., et al. (2022) Speech Emotion Recognition with Co-Attention Based Multi-Level Acoustic Information. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23-27 May 2022, 7367-7371. [Google Scholar] [CrossRef]
[4]	Kim, Y. (2014) Convolutional Neural Networks for Sentence Classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, 25-29 October 2014, 1746-1751. [Google Scholar] [CrossRef]
[5]	Badshah, A.M., Rahim, N., Ullah, N., Ahmad, J., Muhammad, K., Lee, M.Y., Kwon, S. and Baik, S.W. (2017) Deep Features-Based Speech Emotion Recognition for Smart Affective Services. Multimedia Tools and Applications, 78, 5571-5589. [Google Scholar] [CrossRef]
[6]	Sak, H., Senior, A., Rao, K., et al. (2015) Learning Acoustic Frame Labeling for Speech Recognition with Recurrent Neural Networks. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, 19-24 April 2015, 4280-4284. [Google Scholar] [CrossRef]
[7]	Tao, F. and Liu, G. (2018) Advanced LSTM: A Study about Better Time Dependency Modeling in Emotion Recognition. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, 15-20 April 2018, 2906-2910. [Google Scholar] [CrossRef]
[8]	Moritz, N., Hori, T. and Roux, J.L. (2019) Triggered Attention for End-to-end Speech Recognition. Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12-17 May 2019, 5666-5670. [Google Scholar] [CrossRef]
[9]	Chiu, C.C., Sainath, T.N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R.J., Rao, K., Gonina, E., et al. (2018) State-of-the-Art Speech Recognition with Sequence-to-Sequence Models. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, 15-20 April 2018, 4774-4778. [Google Scholar] [CrossRef]
[10]	Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I. (2017) Attention Is All You Need. Proceedings of the Neural Information Processing Systems, Long Beach, CA, 4-9 December 2017, 1-11.
[11]	Zhao, J., Mao, X. and Chen, L. (2019) Speech Emotion Recognition Using Deep 1D & 2D CNN LSTM Networks. Biomedical Signal Processing and Control, 47, 312-323. [Google Scholar] [CrossRef]
[12]	Sainath, T.N., Vinyals, O., Senior, A. and Sak, H. (2015) Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, 19-24 April 2015, 4580-4584. [Google Scholar] [CrossRef]
[13]	Chen, M. and Zhao, X. (2020) A Multi-Scale Fusion Framework for Bimodal Speech Emotion Recognition. Proceedings of the Interspeech 2020, Shanghai, 25-29 October 2020, 374-378. [Google Scholar] [CrossRef]
[14]	Yu, W., Xu, H., Meng, F., et al. (2020) Ch-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-Grained Annotation of Modality. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistic, July 2020, 3718-3727. [Google Scholar] [CrossRef]
[15]	Zadeh, A., Zellers, R., Pincus, E., et al. (2016) Mosi: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. arXiv:1606.06259.
[16]	Busso, C., Bulut, M., Lee, C.C., et al. (2008) IEMOCAP: Interactive Emotional Dyadic Motion Capture Database. Language Resources and Evaluation, 42, 335-359. [Google Scholar] [CrossRef]
[17]	Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W. and Taylor, J.G. (2001) Emotion Recognition in Human-Computer Interaction. IEEE Signal Processing Magazine, 18, 32-80. [Google Scholar] [CrossRef]
[18]	Latif, S., Rana, R., Khalifa, S., Jurdak, R. and Schuller, B. (2022) Self Supervised Adversarial Domain Adaptation for Cross-Corpus and Cross-Language Speech Emotion Recognition. IEEE Trans. Affective Computing, 14, 1912-1926. [Google Scholar] [CrossRef]
[19]	Mustaqeem, Sajjad, M. and Kwon, S. (2020) Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM. IEEE Access, 8, 79861-79875. [Google Scholar] [CrossRef]

为你推荐

友情链接