基于U-Net网络的MVDR声源定位伪峰抑制方法
Pseudo-Peak Suppression for MVDR Sound Source Localization Based on U-Net
摘要: 室内混响环境下的声源定位一直是声学信号处理领域的难点问题。传统的最小方差无失真响应(MVDR)波束形成算法在计算空间功率谱时,由于多径反射的影响,常在非声源位置产生大量伪峰,导致真实声源被淹没或混淆,严重制约了复杂环境下的声源识别能力。本文提出一种基于U-Net深度学习网络的MVDR空间谱后处理方法,将伪峰抑制问题转化为图像去噪任务。该方法以含混响的MVDR空间谱为输入,通过改进的U-Net网络学习从观测谱恢复理想谱的映射关系。网络引入残差结构、空间注意力机制和噪声抑制模块,并设计了结合全局重建、声源增强与伪峰抑制的复合损失函数。仿真实验表明,该方法能精准剥离混响引发的虚假伪影,在保持真实声源结构完整性的同时极大降低了误检风险,从而显著提升了复杂声场中声源目标的判别能力与定位鲁棒性。
Abstract: Sound source localization in indoor reverberant environments remains a critical challenge in the field of acoustic signal processing. When calculating spatial power spectra, the traditional Minimum Variance Distortionless Response (MVDR) beamforming algorithm often generates numerous pseudo-peaks at non-source locations due to multipath reflections. Consequently, true sound sources are frequently masked or obfuscated, severely compromising source identification capabilities in complex acoustic environments. To address this, this paper proposes an MVDR spatial spectrum post-processing method based on a U-Net deep learning network, formulating the pseudo-peak suppression problem as an image denoising task. Taking the reverberant MVDR spatial spectrum as input, the method employs an improved U-Net to learn the mapping relationship required to recover the ideal spectrum from observed data. The network incorporates residual blocks, a Spatial Attention Mechanism, and a Noise Suppression Module. Furthermore, a composite loss function is designed to synergize global reconstruction, source enhancement, and pseudo-peak suppression. Simulation results demonstrate that the proposed method accurately strips away reverberation-induced artifacts and preserves the structural integrity of the true source. By substantially reducing the risk of false detections, the method significantly enhances both the identifiability of sound sources and the robustness of localization in complex sound fields.
文章引用:蒋钦宇, 李红莲, 肖瑶, 任志文, 武欣艺. 基于U-Net网络的MVDR声源定位伪峰抑制方法[J]. 人工智能与机器人研究, 2026, 15(2): 581-592. https://doi.org/10.12677/airr.2026.152056

参考文献

[1] Capon, J. (1969) High-Resolution Frequency-Wavenumber Spectrum Analysis. Proceedings of the IEEE, 57, 1408-1418. [Google Scholar] [CrossRef
[2] Van Trees, H.L. (2002) Optimum Array Processing: Part IV of Detection, Estimation, and Modulation Theory. Wiley. [Google Scholar] [CrossRef
[3] Wang, Y., Deng, Z., Zhao, J., Kopiev, V.F., Gao, D. and Chen, W. (2025) Progress in Beamforming Acoustic Imaging Based on Phased Microphone Arrays: Algorithms and Applications. Measurement, 242, Article ID: 116100. [Google Scholar] [CrossRef
[4] Lobato, T. and Sotteck, R. (2024) Accelerating the CLEAN-SC and CMF Beamforming Deconvolution Methods Using Neural Grid Compression. In: INTER-NOISE and NOISE-CON Congress and Conference Proceedings (INTER-NOISE 2024), Institute of Noise Control Engineering, art00003.
[5] Yardibi, T., Li, J., Stoica, P. and Cattafesta, L.N. (2008) Sparsity Constrained Deconvolution Approaches for Acoustic Source Mapping. The Journal of the Acoustical Society of America, 123, 2631-2642. [Google Scholar] [CrossRef] [PubMed]
[6] Ning, F., Jia, D., Hou, H., Meng, D., Hao, M. and Wei, J. (2025) A High-Resolution Sparse Coherent Sound Source Localization Approach with Improved Sparsity Constraint. Mechanical Systems and Signal Processing, 232, Article ID: 112712. [Google Scholar] [CrossRef
[7] Grumiaux, P., Kitić, S., Girin, L. and Guérin, A. (2022) A Survey of Sound Source Localization with Deep Learning Methods. The Journal of the Acoustical Society of America, 152, 107-151. [Google Scholar] [CrossRef] [PubMed]
[8] Shimada, K., Koyama, Y., Takahashi, S., Takahashi, N., Tsunoo, E. and Mitsufuji, Y. (2022) Multi-ACCDOA: Localizing and Detecting Overlapping Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23-27 May 2022, 316-320. [Google Scholar] [CrossRef
[9] Kim, M., Cheong, S. and Shin, J.W. (2023) DNN-Based Parameter Estimation for MVDR Beamforming and Post-Filtering. INTERSPEECH 2023, Dublin, 20-24 August 2023, 3879-3883. [Google Scholar] [CrossRef
[10] Kim, H., Kang, K. and Shin, J.W. (2022) Factorized MVDR Deep Beamforming for Multi-Channel Speech Enhancement. IEEE Signal Processing Letters, 29, 1898-1902. [Google Scholar] [CrossRef
[11] Ren, X., Zhang, X., Chen, L., Zheng, X., Zhang, C., Guo, L., et al. (2021) A Causal U-Net Based Neural Beamforming Network for Real-Time Multi-Channel Speech Enhancement. Proceedings of INTERSPEECH 2021, Brno, 30 August-3 September 2021, 1832-1836. [Google Scholar] [CrossRef
[12] Jia, H., Yang, F., Hu, X. and Yang, J. (2025) A Dual-Encoder U-Net Architecture with Prior Knowledge Embedding for Acoustic Source Mapping. The Journal of the Acoustical Society of America, 158, 1767-1782. [Google Scholar] [CrossRef
[13] Merino-Martínez, R., Sijtsma, P., Snellen, M., Ahlefeldt, T., Antoni, J., Bahr, C.J., et al. (2019) A Review of Acoustic Imaging Methods Using Phased Microphone Arrays. CEAS Aeronautical Journal, 10, 197-230. [Google Scholar] [CrossRef
[14] He, K., Zhang, X., Ren, S. and Sun, J. (2016) Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 770-778. [Google Scholar] [CrossRef
[15] Habets, E.A.P. (2006) Room Impulse Response Generator. Technische Universiteit Eindhoven, 1-24.
[16] Wang, Z., Bovik, A.C., Sheikh, H.R. and Simoncelli, E.P. (2004) Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, 13, 600-612. [Google Scholar] [CrossRef] [PubMed]