基于深度学习的音乐源分离技术研究
Research on Music Source Separation Technology Based on Deep Learning
DOI: 10.12677/CSA.2022.1212283, PDF,   
作者: 卞宇仁:天津工业大学,天津
关键词: 深度学习源分离音乐Deep Learning Source Separation Music
摘要: 音乐源分离技术在音乐产业中起着重要的作用。随着深度学习的发展,音乐源分离技术也产生了巨大的变化,由传统的基于知识的源分离转变为数据驱动的源分离。本文将基于深度学习的音乐源分离分为基于频域的音乐源分离和基于时域的音乐源分离,探讨了这些深度学习模型的原理和优缺点,并介绍了音乐源分离数据集的发展历史,最后对音乐源分离技术的进一步发展做了展望。
Abstract: Music source separation technology plays an important role in the music industry. With the de-velopment of deep learning, music source separation technology has also produced great changes, from the traditional knowledge-based source separation to data-driven source separation. In this paper, music source separation based on deep learning is divided into music source separation based on frequency domain and music source separation based on time domain. The principles, advantages and disadvantages of these deep learning models are discussed, and the development history of music source separation data sets is introduced. Finally, the further de-velopment of music source separation technology is prospected.
文章引用:卞宇仁. 基于深度学习的音乐源分离技术研究[J]. 计算机科学与应用, 2022, 12(12): 2788-2794. https://doi.org/10.12677/CSA.2022.1212283

参考文献

[1] Bahmaninezhad, F., Wu, J., Gu, R., Zhang, S.-X., Xu, Y., Yu, M. and Yu, D. (2019) A Comprehensive Study of Speech Separation: Spectrogram vs Waveform Separation. Interspeech 2019, 20th Annual Conference of the In-ternational Speech Communication Association, Graz, 15-19 September 2019, 4574-4578. [Google Scholar] [CrossRef
[2] Abdi, H. and Williams, L.J. (2010) Principal Component Analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2, 433-459. [Google Scholar] [CrossRef
[3] Abdali, S. and NaserSharif, B. (2017) Non-Negative Matrix Factorization for Speech/Music Separation Using Source Dependent Decomposition Rank, Temporal Continuity Term and Fil-tering. Biomedical Signal Processing and Control, 36, 168-175. [Google Scholar] [CrossRef
[4] Ozerov, A., Vincent, E. and Bimbot, F. (2012) A General Flexible Framework for the Handling of Prior Information in Audio Source Separation. IEEE Transactions on Audio, Speech, and Language Processing, 20, 1118-1133. [Google Scholar] [CrossRef
[5] Pons, J., Janer, J. and Rode, T. (2016) Remixing Music Using Source Separation Algorithms to Improve the Musical Experience of Cochlear Implant Users. Journal of the Acoustical Society of America, 140, 4338-4349. [Google Scholar] [CrossRef] [PubMed]
[6] Heo, W.H., Kim, H. and Kwon, O.W. (2020) Source Separation Using Dilated Time-Frequency DenseNet for Music Identification in Broadcast Contents. Applied Sciences-Basel, 10, Ar-ticle No. 1727. [Google Scholar] [CrossRef
[7] Stöter, F.-R., Liutkus, A. and Ito, N. (2018) The 2018 Signal Sep-aration Evaluation Campaign. Springer International Publishing, Cham. [Google Scholar] [CrossRef
[8] Jao, P.-K., Su, L., Yang, Y.-H. and Wohlberg, B. (2016) Monaural Music Source Separation Using Convolutional Sparse Coding. IEEE-ACM Transactions on Audio Speech and Language Processing, 24, 2158-2170. [Google Scholar] [CrossRef
[9] Luo, Y. and Mesgarani, N. (2019) Conv-TasNet: Surpas-sing Ideal Time-Frequency Magnitude Masking for Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27, 1256-1266. [Google Scholar] [CrossRef
[10] Brandenburg, K. and Sporer, T. (1992) NMR and Masking Flag: Evaluation of Quality Using Perceptual Criteria. Audio Engineering Society Conference: 11th International Conference: Test & Measurement, Portland, Oregon, 29-31 May 1992, Paper No. 11-020.
https://www.aes.org/e-lib/online/browse.cfm?elib=6276
[11] Emiya, V., Vincent, E., Harlander, N. and Hohmann, V. (2011) Subjective and Objective Quality Assessment of Audio Source Separation. IEEE Transactions on Audio, Speech, and Language Processing, 19, 2046-2057. [Google Scholar] [CrossRef
[12] Cano, E., FitzGerald, D. and Brandenburg, K. (2016) Evaluation of Quality of Sound Source Separation Algorithms: Human Perception vs Quantitative Metrics. 2016 24th European Signal Processing Conference (EUSIPCO), Budapest, 29 August-2 September 2016, 1758-1762. [Google Scholar] [CrossRef
[13] Févotte, C., Gribonval, R. and Vincent, E. (2005) BSS_EVAL Toolbox User Guide—Revision 2.0.
[14] Roux, J.L., Wisdom, S., Erdogan, H. and Hershey, J.R. (2019) SDR—Half-Baked or Well Done? ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, 12-17 May 2019, 626-630. [Google Scholar] [CrossRef
[15] ITU (2014) Recommendation ITU-R BS.1534-3: Method for the Subjective Assessment of Intermediate Quality Level of Audio Systems.
[16] 赵毅. 空间音频编码及多声道音频恢复技术研究[D]: [硕士学位论文]. 北京: 北京理工大学, 2015.
[17] Rafii, Z., et al. (2017) MUSDB18—A Corpus for Music Separation.
[18] Liu, H., Kong, Q. and Liu, J. (2021) CWS-PResUNet: Music Source Separation with Channel-Wise Subband Phase-Aware Resunet.
[19] Diakogiannis, F.I., Waldner, F., Caccetta, P. and Wu, C. (2020) ResUNet-a: A Deep Learning Framework for Semantic Segmentation of Remotely Sensed Data. ISPRS Journal of Photogrammetry and Remote Sensing, 162, 94-114. [Google Scholar] [CrossRef
[20] Takahashi, N., Goswami, N. and Mitsufuji, Y. (2018) Mmdenselstm: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation. 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), Tokyo, 17-20 Sep-tember 2018, 106-110. [Google Scholar] [CrossRef
[21] Iandola, F., et al. (2014) Densenet: Implementing Efficient Convnet Descriptor Pyramids.
[22] Stöter, F.-R., Uhlich, S., Liutkus, A. and Mitsufuji, Y. (2019) Open-Unmix—A Reference Implementation for Music Source Separation. Journal of Open Source Software, 4, Article No. 1667. [Google Scholar] [CrossRef
[23] Lluís, F., Pons, J. and Serra, X. (2018) End-to-End Music Source Separation: Is It Possible in the Waveform Domain? Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, 15-19 September 2019, 4619-4623. [Google Scholar] [CrossRef
[24] Stoller, D., Ewert, S. and Dixon, S. (2018) Wave-u-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation.
[25] Luo, Y. and Mesgarani, N. (2018) Tasnet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, 15-20 April 2018, 696-700. [Google Scholar] [CrossRef
[26] Défossez, A., et al. (2019) Demucs: Deep Ex-tractor for Music Sources with Extra Unlabeled Data Remixed.
[27] Luo, Y., Chen, Z. and Yoshioka, T. (2020) Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, 4-8 May 2020, 46-50. [Google Scholar] [CrossRef
[28] Nachmani, E., Adi, Y. and Wolf, L. (2020) Voice Separation with an Unknown Number of Multiple Speakers. Proceedings of the 37th International Conference on Machine Learning, Vienna, 12-18 July 2020, 7164-7175.
[29] Rafii, Z., Liutkus, A., Stöter, F.-R., Mimilakis, S.I., FitzGerald, D. and Pardo, B. (2018) An Overview of Lead and Accompaniment Separation in Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26, 1307-1335. [Google Scholar] [CrossRef
[30] Hsu, C.-L. and Jang, J.-S.R. (2010) On the Improvement of Singing Voice Separation for Monaural Recordings Using the Mir-1k Dataset. IEEE Transactions on Audio, Speech, and Language Processing, 18, 310-319. [Google Scholar] [CrossRef
[31] Bittner, R.M., et al. (2014) MedleyDB: A Multitrack Dataset for Annotation-Intensive Mir Research. 15th International Society for Music Information Retrieval Conference (ISMIR 2014), Taipei, 27-31 October 2014, 155-160.