针对音频源分离平台的鲁棒性提升
Robustness Improvement for Audio Source Separation Platform
摘要: 尽管基于神经网络的音频源分离方法具有优异的性能和广泛的应用范围,但其对故意攻击的鲁棒性在很大程度上被忽视了。本文在音频源分离平台rl_singing_voice-master的基础上提出了一种新的分离平台结构,该分离平台引入了自注意力机制(self-attention)并使用变分丢弃法(Variational Drop-out)对其进行正则化处理。实验结果表明,在MUSDN18数据集上,改进后的音频源分离平台相较于原分离平台,在面对对抗性样本的故意攻击时,鲁棒性也得到了明显提升,分离性能也得到了优化。
Abstract: Although the neural network based audio source separation method has excellent performance and a wide range of applications, its robustness to intentional attacks has been largely ignored. In this paper, the audio source separation platform rl_singing_voice-master. On the basis of voice master, a new separation platform structure is proposed, which introduces self attention mechanism and regularizes it using variational drop. The experimental results show that compared with the original separation platform, the improved audio source separation platform on the MUSDN18 dataset has significantly improved robustness and separation performance when facing intentional attacks on adversarial samples.
文章引用:李明圆. 针对音频源分离平台的鲁棒性提升[J]. 计算机科学与应用, 2022, 12(10): 2268-2274. https://doi.org/10.12677/CSA.2022.1210231

参考文献

[1] Mesaros, A. and Virtanen, T. (2010) Recognition of Phonemes and Words in Singing. 2010 IEEE International Confer-ence on Acoustics, Speech and Signal Processing, Dallas, 14-19 March 2010, 2146-2149. [Google Scholar] [CrossRef
[2] Fujihara, H., Goto, M., Ogata, J. and Okuno, H.G. (2011) Lyric Synchronizer: Automatic Synchronization System between Musical Audio Signals and Lyrics. IEEE Journal of Se-lected Topics in Signal Processing, 5, 1252-1261. [Google Scholar] [CrossRef
[3] Sharma, B., Gupta, C., Li, H. and Wang, Y. (2019) Automatic Lyrics-to-Audio Alignment on Polyphonic Music Using Singing-Adapted Acoustic Models. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, 12-17 May 2019, 396-400. [Google Scholar] [CrossRef
[4] Gillet, O. and Richard, G. (2008) Transcription and Separation of Drum Signals from Polyphonic Music. The IEEE/ACM Transactions on Audio, Speech, and Language Processing, 3, 529-540. [Google Scholar] [CrossRef
[5] Manilow, E., Seetharaman, P. and Pardo, B. (2020) Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments. IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, 4-8 May 2020, 771-775. [Google Scholar] [CrossRef
[6] Gómez, J.S., Abeßer, J. and Cano, E. (2018) Jazz Solo Instrument Classification with Convolutional Neural Networks, Source Separation, and Transfer Learning. Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR, Paris, 23-27 September 2018, 577-584.
[7] Liu, J.-Y., Chen, Y.-H., Yeh, Y.-C. and Yang, Y.-H. (2019) Score and Lyrics-Free Singing Voice Gen-eration.
[8] Jansson, A., Humphrey, E.J., Montecchio, N., Bittner, R.M., Kumar, A. and Weyde, T. (2017) Singing Voice Separation with Deep U-Net Convolutional Networks. 18th International Society for Music Information Retrieval Conference, Suzhou, 23-27 October 2017, 745-751.
[9] Takahashi, N. and Mitsufuji, Y. (2017) Multi-Scale Mul-ti-Band DenseNets for Audio Source Separation. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, 15-18 October 2017, 261-265. [Google Scholar] [CrossRef
[10] Takahashi, N., Goswami, N. and Mitsufuji, Y. (2018) Mmdenselstm: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation. 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), Tokyo, 17-20 September 2018, 106-110. [Google Scholar] [CrossRef
[11] Lee, J.H., Choi, H.-S. and Lee, K. (2019) Audio Query-Based Music Source Separation. Proceedings of the 20th International Society for Music Information Retrieval Conference, ISMIR, Delft, 4-8 November 2019, 878-885.
[12] Liu, J.-Y. and Yang, Y.-H. (2019) Dilated Convolution with Dilated GRU for Music Source Separation. Proceedings International Joint Conference on Artificial Intelligence (IJCAI), Macao, 10-16 August 2019, 4718-4724.
[13] Luo, Y. and Mesgarani, N. (2019) Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation. The IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27, 1256-1266. [Google Scholar] [CrossRef
[14] Madry, A., Makelov, A., Schmidt, L., et al. (2017) Towards Deep Learning Models Resistant to Adversarial Attacks.
[15] Moosavi-Dezfooli, S.M., Fawzi, A. and Frossard, P. (2016) DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 2574-2582. [Google Scholar] [CrossRef
[16] Goodfellow, I.J., Shlens, J. and Szegedy, C. (2014) Explaining and Harnessing Adversarial Examples.
[17] Papernot, N., McDaniel, P., Jha, S., et al. (2016) The Limitations of Deep Learning in Adversarial Settings. 2016 IEEE European Symposium on Security and Privacy, Saarbruecken, 21-24 March 2016, 372-387. [Google Scholar] [CrossRef
[18] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Good-fellow, I. and Fergus, R. (2014) Intriguing Properties of Neural Networks.
https://arxiv.org/abs/1312.6199
[19] Mimilakis, S.I., Drossos, K. and Schuller, G. (2020) Unsupervised Interpret-able Representation Learning for Singing Voice Separation. EUSIPCO 2020, Amsterdam, 24-28 August 2020, 1412-1416.
[20] Rafii, Z., Liutkus, A., Stöter, F.-R., Mimilakis, S.I. and Bittner, R. (2017) MUSDB18—A Corpus for Music Separation.
https://hal.inria.fr/hal-02190845
[21] Vincent, E., Gribonval, R. and Févotte, C. (2006) Perfor-mance Measurement in Blind Audio Source Separation. IEEE Transactions on Audio, Speech, and Language Pro-cessing, 14, 1462-1469. [Google Scholar] [CrossRef