基于改进型TF-GridNet的音乐源分离方法研究
Research on an Improved TF-GridNet-Based Method for Music Source Separation
摘要: 音乐源分离任务旨在从混合音频信号中提取出不同的音源成分(如人声、鼓、贝斯等),在音乐制作、智能语音、音频编辑等领域具有广泛的应用价值。TF-GridNet是近年来提出的一种融合时间与频率建模的深度网络结构,具备良好的建模能力。文章在TF-GridNet的基础上进行结构性改进,提出三项关键优化策略:1) 在时间建模路径中引入轻量通道注意力机制,提升模型对关键信号的响应能力;2) 在 GridBlock模块中引入残差门控机制,增强特征流动与融合灵活性;3) 在解码器部分设计多尺度重建路径,以提升高频细节还原效果。实验结果表明,改进后的模型在多源类别上优于原始TF-GridNet,并具备更优的感知质量与计算效率。
Abstract: Music source separation aims to effectively extract different components (such as vocals, drums, bass, etc.) or vocal components from a mixed audio signal, which has significant application value in intelligent audio processing, music production, and speech recognition. TF-GridNet is a recently proposed network architecture that combines time and frequency modeling and has demonstrated strong separation capability. Based on TF-GridNet, this paper introduces three structural improvements: 1) Introduces a lightweight channel attention mechanism in the time modeling path to enhance the model’s ability to respond to critical signals; 2) Introduces a residual gating mechanism in the GridBlock module to enhance the flexibility of feature flow and fusion; 3) Designs multi-scale reconstruction paths in the decoder section to improve high-frequency detail restoration effects. Experiments conducted on the MUSDB18 dataset show that the improved model outperforms the original TF-GridNet on multiple source types, with better modeling efficiency and perceptual quality.
文章引用:柯爱鹏, 张学典. 基于改进型TF-GridNet的音乐源分离方法研究[J]. 建模与仿真, 2025, 14(5): 768-778. https://doi.org/10.12677/mos.2025.145432

参考文献

[1] Comon, P. (1994) Independent Component Analysis, a New Concept? Signal Processing, 36, 287-314. [Google Scholar] [CrossRef
[2] Virtanen, T. (2007) Monaural Sound Source Separation by Nonnegative Matrix Factorization with Temporal Continuity and Sparseness Criteria. IEEE Transactions on Audio, Speech and Language Processing, 15, 1066-1074. [Google Scholar] [CrossRef
[3] Stöter, F., Uhlich, S., Liutkus, A. and Mitsufuji, Y. (2019) Open-Unmix—A Reference Implementation for Music Source Separation. Journal of Open Source Software, 4, Article No. 1667. [Google Scholar] [CrossRef
[4] Défossez, A., Usunier, N., Bottou, L. and Bach, F. (2021) Hybrid Spectrogram and Waveform Source Separation. Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, 6-14 December 2021.
[5] Luo, Y. and Mesgarani, N. (2019) Conv-Tasnet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27, 1256-1266. [Google Scholar] [CrossRef] [PubMed]
[6] Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M. and Zhong, J. (2021) Attention Is All You Need in Speech Separation. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, 6-11 June 2021, 21-25.
[7] ITU-T (1996) Recommendation P.800: Methods for Subjective Determination of Transmission Quality. International Tele-Communication Union.
[8] Luo, Y., Chen, J., Du, J. and Yoshioka, T. (2023) DiffSep: Leveraging Diffusion Models for Speech Separation. IEEE Proceedings of ICASSP 2023, Rhodes Island, 4-10 June 2023, 1-5.
[9] Luo, Y., Lin, Z.-Q., Zhang, J. and Mesgarani, N. (2022) TF-GridNet: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation. Proceedings of Interspeech 2022, Incheon, 18-22 September 2022, 2768-2772.
[10] Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W. and Hu, Q. (2020) ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 14-19 June 2020, 11531-11539. [Google Scholar] [CrossRef
[11] He, K., Zhang, X., Ren, S. and Sun, J. (2016) Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 770-778. [Google Scholar] [CrossRef
[12] Ronneberger, O., Fischer, P. and Brox, T. (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab, N., et al., Eds., Medical Image Computing and Computer-Assisted InterventionMICCAI 2015, Springer International Publishing, 234-241. [Google Scholar] [CrossRef
[13] Rafii, Z., Liutkus, A., Stoeter, F.-R., Mimilakis, S.I. and Bittner, R. (2017) The MUSDB18 Corpus for Music Separation. Machine Learning for Signal Processing (MLSP).
https://sigsep.github.io/datasets/musdb.html
[14] Vincent, E., Gribonval, R. and Fevotte, C. (2006) Performance Measurement in Blind Audio Source Separation. IEEE Transactions on Audio, Speech and Language Processing, 14, 1462-1469. [Google Scholar] [CrossRef
[15] Le Roux, J., Weiss, R.J. and Kinoshita, K. (2019) SNR-Based Objective Evaluation of Source Separation Methods. IEEE Transactions on Audio, Speech, and Language Processing, 27, 929-941.