U-Net网络中结合卷积与注意力机制的跳跃连接的单通道语音增强
Combining Convolution and Attention with Skip Connection in U-Net for Monaural Speech Enhancement
摘要: 在可懂度和感知质量方面,单通道语音增强技术从深度学习的成功中获得了巨大收益。传统的方法侧重于应用U-Net模型来预测带噪语音的纯净信号,这种模型的跳跃连接以及序列建模模块存在局限性。本研究提出了在U-Net网络中结合卷积与注意力机制的跳跃连接的单通道语音增强算法。一方面,基于卷积的跳跃连接(convolution skip)应用含有卷积门控机制的卷积模块来提取更重要的局部特征信息;另一方面,基于注意力机制的跳跃连接(attention skip)结合了ROPE位置编码与图卷积网络(GCN),从而能够更好地提取上下文全局特征信息;除此之外,conformer-block模块应用了卷积门控单元(CGU)与多头增强的注意力机制,它能够更好的建模序列信息。在VoiceBank-DEMAND语音数据集上对提出的方法进行了验证,在噪声数据上获得了0.6975的语音感知质量评估(PESQ)提升、0.0124的语音短时客观可懂度(STOI)提升以及8.4324的分段信噪比(SSNR)提升。实验结果表明,与基线denoiser方法相比,提出来的方法更有优越性。
Abstract: In terms of intelligibility and perceptual quality, single-channel speech enhancement technology has benefited greatly from the success of deep learning. Traditional methods focus on applying the U-Net model to predict the clean signal from noisy speech, but this model’s skip connections and sequence modeling modules have limitations. This study proposes a single-channel speech enhancement algorithm that combines convolution and attention mechanisms in the skip connections of the U-Net network. On the one hand, the convolution skip connection applies a convolution module with a convolution gating mechanism to extract more important local feature information. On the other hand, the attention skip connection combines ROPE position encoding and graph convolutional networks (GCN) to better extract contextual global feature information. Besides, the conformer-block module applies a convolutional gated unit (CGU) and an enhanced multi-head attention mechanism, which can better model sequence information. The proposed method was validated on the VoiceBank-DEMAND speech dataset, achieving an improvement of 0.6975 in Perceptual Evaluation of Speech Quality (PESQ), 0.0124 in Short-Term Objective Intelligibility (STOI), and 8.4324 in Segmental Signal-to-Noise Ratio (SSNR) on noisy data. Experimental results show that the proposed method is superior to the baseline denoiser method.
文章引用:谭应伟. U-Net网络中结合卷积与注意力机制的跳跃连接的单通道语音增强[J]. 计算机科学与应用, 2026, 16(6): 90-102. https://doi.org/10.12677/csa.2026.166211

参考文献

[1] Berouti, M., Schwartz, R. and Makhoul, J. (1979) Enhancement of Speech Corrupted by Acoustic Noise. Proceedings of the 1979 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Washington, 2-4 April 1979, 208-211.
[2] Ephraim, Y. (1992) Statistical-Model-Based Speech Enhancement Systems. Proceedings of the IEEE, 80, 1526-1555. [Google Scholar] [CrossRef
[3] Lim, J. and Oppenheim, A. (1978) All-Pole Modeling of Degraded Speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 26, 197-210.
[4] Dendrinos, M., Bakamidis, S. and Carayannis, G. (1991) Speech Enhancement from Noise: A Regenerative Approach. Speech Communication, 10, 45-57. [Google Scholar] [CrossRef
[5] Ephraim, Y. and Van Trees, H.L. (1995) A Signal Subspace Approach for Speech Enhancement. IEEE Transactions on Speech and Audio Processing, 3, 251-266. [Google Scholar] [CrossRef
[6] Pascual, S., Bonafonte, A. and Serrà, J. (2017) SEGAN: Speech Enhancement Generative Adversarial Network. Interspeech 2017, Stockholm, 20-24 August 2017, 3642-3646. [Google Scholar] [CrossRef
[7] Cao, R., Abdulatif, S. and Yang, B. (2022) CMGAN: Conformer-Based Metric GAN for Speech Enhancement. Interspeech 2022, Incheon, 18-22 September 2022, 936-940. [Google Scholar] [CrossRef
[8] Kim, M., Song, H., Cheong, S. and Shin, J.W. (2022) iDeepMMSE: An Improved Deep Learning Approach to MMSE Speech and Noise Power Spectrum Estimation for Speech Enhancement. Interspeech 2022, Incheon, 18-22 September 2022, 181-185. [Google Scholar] [CrossRef
[9] Hwang, S., Park, S. and Park, Y. (2022) Monoaural Speech Enhancement Using a Nested U-Net with Two-Level Skip Connections. Interspeech 2022, Incheon, 18-22 September 2022, 191-195. [Google Scholar] [CrossRef
[10] Fu, Y., Liu, Y., Li, J., Luo, D., Lv, S., Jv, Y., et al. (2022) Uformer: A UNet Based Dilated Complex & Real Dual-Path Conformer Network for Simultaneous Speech Enhancement and Dereverberation. ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23-27 May 2022, 7417-7421. [Google Scholar] [CrossRef
[11] Wang, H. and Tian, B. (2025) ZipEnhancer: Dual-Path Down-Up Sampling-Based Zipformer for Monaural Speech Enhancement. ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, 6-11 April 2025, 1-5. [Google Scholar] [CrossRef
[12] Lee, S., Cheong, S., Han, S. and Shin, J.W. (2025) FlowSE: Flow Matching-Based Speech Enhancement. ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, 6-11 April 2025, 1-5. [Google Scholar] [CrossRef
[13] Gulati, A., Qin, J., Chiu, C., Parmar, N., Zhang, Y., Yu, J., et al. (2020) Conformer: Convolution-Augmented Transformer for Speech Recognition. Interspeech 2020, Shanghai, 25-29 October 2020, 5036-5040. [Google Scholar] [CrossRef
[14] Chen, Z., Yoshioka, T., Lu, L., Zhou, T., Meng, Z., Luo, Y., et al. (2020) Continuous Speech Separation: Dataset and Analysis. ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, 4-8 May 2020, 7284-7288. [Google Scholar] [CrossRef
[15] 胡从刚, 申艺翔, 孙永奇, 等. 基于Conformer的端到端语音识别方法[J]. 计算机应用研究, 2024, 41(7): 2018-2024.
[16] Koizumi, Y., Karita, S., Wisdom, S., Erdogan, H., Hershey, J.R., Jones, L., et al. (2021) DF-Conformer: Integrated Architecture of Conv-Tasnet and Conformer Using Linear Complexity Self-Attention for Speech Enhancement. 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, 17-20 October 2021, 161-165. [Google Scholar] [CrossRef
[17] Abdulatif, S., Armanious, K., Guirguis, K., Sajeev, J.T. and Yang, B. (2021) AeGAN: Time-Frequency Speech Denoising via Generative Adversarial Networks. 2020 28th European Signal Processing Conference (EUSIPCO), Amsterdam, 18-21 January 2021, 451-455. [Google Scholar] [CrossRef
[18] Abdulatif, S., Armanious, K., Sajeev, J.T., Guirguis, K. and Yang, B. (2021) Investigating Cross-Domain Losses for Speech Enhancement. 2021 29th European Signal Processing Conference (EUSIPCO), Dublin, 23-27 August 2021, 411-415. [Google Scholar] [CrossRef
[19] Hu, Y., Liu, Y., Lv, S., Xing, M., Zhang, S., Fu, Y., et al. (2020) DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement. Interspeech 2020, Shanghai, 25-29 October 2020, 3885-3889. [Google Scholar] [CrossRef
[20] Défossez, A., Synnaeve, G. and Adi, Y. (2020) Real Time Speech Enhancement in the Waveform Domain. Interspeech 2020, Shanghai, 25-29 October 2020, 3291-3295. [Google Scholar] [CrossRef
[21] Kim, D., Chung, S., Han, H., Ji, Y. and Kang, H. (2023) HD-DEMUCS: General Speech Restoration with Heterogeneous Decoders. INTERSPEECH 2023, Dublin, 20-24 August 2023, 4125-4129. [Google Scholar] [CrossRef
[22] Wang, K., He, B. and Zhu, W. (2021) TSTNN: Two-Stage Transformer Based Neural Network for Speech Enhancement in the Time Domain. ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, 6-11 June 2021, 7098-7102. [Google Scholar] [CrossRef
[23] Défossez, A., Berrada, L., Dumoulin, V., et al. (2020) Music Source Separation in the Waveform Domain. arXiv: 1911.13254.
[24] 武瑞沁, 陈雪勤, 俞杰, 王丽荣, 赵鹤鸣. 结合注意力机制的改进U-Net网络在端到端语音增强中的应用[J]. 声学学报, 2022, 47(2): 266-275.
[25] 范君怡, 杨吉斌, 张雄伟, 郑昌艳. U-net网络中融合多头注意力机制的单通道语音增强[J]. 声学学报, 2022, 47(6): 703-716.
[26] Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W. and Liu, Y. (2024) Roformer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing, 568, Article ID: 127063. [Google Scholar] [CrossRef
[27] Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A. and Vandergheynst, P. (2017) Geometric Deep Learning: Going Beyond Euclidean Data. IEEE Signal Processing Magazine, 34, 18-42. [Google Scholar] [CrossRef
[28] Sadasivan, J., Seelamantula, C.S. and Muraka, N.R. (2020) Speech Enhancement Using a Risk Estimation Approach. Speech Communication, 116, 12-29. [Google Scholar] [CrossRef
[29] Cheng, J., Liang, R., Liang, Z., et al. (2023) A Deep Adaptation Network for Speech Enhancement: Combining a Relativistic Discriminator with Multi-Kernel Maximum Mean Discrepancy. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 41-53.
[30] Hsieh, T., Wang, H., Lu, X. and Tsao, Y. (2020) WaveCRN: An Efficient Convolutional Recurrent Neural Network for End-To-End Speech Enhancement. IEEE Signal Processing Letters, 27, 2149-2153. [Google Scholar] [CrossRef
[31] Yu, Z., Yu, L., Zheng, W. and Wang, S. (2023) EIU-Net: Enhanced Feature Extraction and Improved Skip Connections in U-Net for Skin Lesion Segmentation. Computers in Biology and Medicine, 162, Article ID: 107081. [Google Scholar] [CrossRef] [PubMed]
[32] Kipf, T.N. and Welling, M. (2017) Semi-Supervised Classification with Graph Convolutional Networks. arXiv: 1609.02907.
[33] Valentini-Botinhao, C., Wang, X., Takaki, S. and Yamagishi, J. (2016) Investigating RNN-Based Speech Enhancement Methods for Noise-Robust Text-To-Speech. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), Sunnyvale, 13-15 September 2016, 146-152. [Google Scholar] [CrossRef
[34] Veaux, C., Yamagishi, J. and King, S. (2013) The Voice Bank Corpus: Design, Collection and Data Analysis of a Large Regional Accent Speech Database. 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), Gurgaon, 25-27 November 2013, 1-4. [Google Scholar] [CrossRef
[35] Thiemann, J., Ito, N. and Vincent, E. (2013) Demand: A Collection of Multi-Channel Recordings of Acoustic Noise in Diverse Environments. Proceedings of Meetings on Acoustics, Paris, 2-7 June 2013, 1-8.
[36] Varga, A. and Steeneken, H.J.M. (1993) Assessment for Automatic Speech Recognition: II. NOISEX-92: A Database and an Experiment to Study the Effect of Additive Noise on Speech Recognition Systems. Speech Communication, 12, 247-251. [Google Scholar] [CrossRef
[37] Yamamoto, R., Song, E. and Kim, J. (2020) Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram. ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, 4-8 May 2020, 6199-6203. [Google Scholar] [CrossRef