基于LSTM的端到端声纹识别算法实现
LSTM-Based End-to-End Voiceprint Recognition Algorithm Implementation
DOI: 10.12677/SEA.2021.104052, PDF,   
作者: 王 飞*:浙江理工大学信息学院,浙江 杭州;徐颖捷:浙江理工大学启新学院,浙江 杭州
关键词: 声纹识别端对端损失LSTM神经网络Voiceprint Recognition End-to-End Losses LSTM Neural Network
摘要: 近年来,随着神经网络在语音识别领域应用中的快速发展,深度学习被应用到声纹识别领域,取得了很好的效果。本文先是介绍了声纹识别的基本理论,说明了语音信号预处理和特征识别的一般方法,而后又介绍了一种基于LSTM神经网络的端对端声纹识别算法,从理论上说明了这种算法的优越性。通过这种算法构建的说话人声纹识别模型,大大节省了模型训练的时间,训练效果较好。
Abstract: In recent years, with the rapid development of neural networks in the field of speech recognition applications, deep learning has been applied to the field of voiceprint recognition with good results. In this paper, we first introduce the basic theory of voiceprint recognition and illustrate the general methods of speech signal preprocessing and feature recognition, and then we introduce an end-to-end voiceprint recognition algorithm based on LSTM Neural Network to illustrate the theo-retical superiority of this algorithm. The speaker vocal pattern recognition model constructed by this algorithm greatly saves the time of model training and the training effect is better.
文章引用:王飞, 徐颖捷. 基于LSTM的端到端声纹识别算法实现[J]. 软件工程与应用, 2021, 10(4): 467-479. https://doi.org/10.12677/SEA.2021.104052

参考文献

[1] 樊云云. 面向说话人识别的深度学习方法研究[D]: [硕士学位论文]. 南昌: 南昌航空大学, 2019.
[2] Luck, J.E. (1969) Auto-matic Speaker Verification Using Cepstral Measurements. Journal of the Acoustical Society of America, 46, 1026-1032. [Google Scholar] [CrossRef] [PubMed]
[3] Atal, B.S. (1976) Automatic Recognition of Speakers from Their Voices. Proceedings of the IEEE, 64, 460-475. [Google Scholar] [CrossRef
[4] Davis, S. and Mermelstein, P. (1980) Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Signal Processing, 28, 357-366. [Google Scholar] [CrossRef
[5] Sakoe, H. and Chiba, S. (1978) Dynamic Programming Algorithm Optimi-zation for Spoken Word Recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 26, 43-49. [Google Scholar] [CrossRef
[6] Matsui, T. and Furui, S. (1994) Comparison of Text-Independent Speaker Recognition Methods Using VQ-Distortion and Discrete/Continuous HMM’s. IEEE Transactions on Speech and Audio Processing, 2, 456-459. [Google Scholar] [CrossRef
[7] Kenny, P. (2005) Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms.
[8] Lei, Y., Scheffer, N., Ferrer, L. and McLaren, M. (2014) A Novel Scheme for Speaker Recognition Using a Pho-netically-Aware Deep Neural Network. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, 4-9 May 2014, 1695-1699. [Google Scholar] [CrossRef
[9] Deng, J., Dong, W., Socher, R., et al. (2009) ImageNet: A Large-Scale Hierarchical Image Database. IEEE Conference on Computer Vision and Pattern Recognition, Miami, 20-25 June 2009, 248-255. [Google Scholar] [CrossRef
[10] 刘华平, 李昕, 徐柏龄, 姜宁. 语音信号端点检测方法综述及展望[J]. 计算机应用研究, 2008(8): 2278-2283.
[11] 胡航. 现代语音信号处理[M]. 北京: 电子工业出版社, 2014: 74.
[12] Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9, 1735-1780. [Google Scholar] [CrossRef] [PubMed]
[13] Heigold, G., Moreno, I., Bengio, S. and Shazeer, N. (2016) End-to-End Text-Dependent Speaker Verification. 2016 IEEE International Conference on in Acoustics, Speech and Signal Processing, Shanghai, 20-25 March 2016, 5115-5119. [Google Scholar] [CrossRef
[14] Schroff, F., Kalenichenko, D. and Philbin, J. (2015) Facenet: A Unified Embedding for Face Recognition and Clustering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 7-12 June 2015, 815-823. [Google Scholar] [CrossRef
[15] Prabhavalkar, R., Alvarez, R., Parada, C., Nakkiran, P. and Sainath, T.N. (2015) Automatic Gain Control and Multi-Style Training for Robust Small-Footprint Keyword Spotting with Deep Neural Networks. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, South Brisbane, 19-24 April 2015, 4704-4708. [Google Scholar] [CrossRef
[16] Sak, H., Senior, A. and Beaufays, F. (2014) Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition.
https://arxiv.org/abs/1402.1128
[17] Pascanu, R., Mikolov, T. and Bengio, Y. (2012) On the Difficulty of Training Recurrent Neural Networks.
https://arxiv.org/abs/1211.5063