基于深度学习的小样本声纹识别研究
Research on Small Sample Voiceprint Recognition Based on Deep Learning
DOI: 10.12677/AAM.2020.91004, PDF,   
作者: 韩 侣, 马文联*:长春理工大学理学院应用数学系,吉林 长春;周林华, 郑伟杰, 马 涛, 李天星:数学实验省级教学示范中心(长春理工大学),吉林 长春
关键词: 声纹识别深度置信网络支持向量机随机森林Voiceprint Recognition Deep Belief Networks Support Vector Machine Random Forest
摘要: 本文研究了小样本声纹识别问题。实验中采用梅尔倒谱系数与其动态差分系数组成的39维特征作为基本声学特征,再将基本声学特征通过由三层受限玻尔兹曼机堆叠而成的深度置信网络提取128维深度声学特征,最后通过支持向量机和随机森林进行分声纹识别。训练深度置信网络时,每个说话人选用短时语音信号组成的小样本数据作为该网络的训练集,同时将训练好的深度置信网络模型作为深度声学特征提取器,用该特征提取器对非训练集中说话人语音信号提取深度声学特征,进一步验证了该深度声学特征提取器的泛化能力。实验结果表明,本文设计的声纹识别模型识别准确率高,且深度特征提取器有较好的泛化能力。
Abstract: This paper studies the problem of small sample voiceprint recognition. In the experiment, the 39-dimensional features composed of the Mel cepstral coefficient and its dynamic differential coefficients are used as the basic acoustic features, and the basic acoustic features are extracted from the 128-dimensional depth acoustics through a deep belief network stacked by a three-layer restricted Boltzmann machine. Finally through the support vector machine and random forest for voiceprint recognition. Training deep belief networks, each speaker to choose short speech signal of small sample data as the network training set, trained deep belief network model at the same time as the depth of the acoustic feature extraction, with the characteristics of the extractor on the depth of the training focus on the speaker voice signal extraction acoustic characteristics, the generalization ability of the depth acoustic feature extractor is further verified. The experimental results show that the soundprint recognition model designed in this paper has high recognition accuracy, and the depth feature extractor has better generalization ability.
文章引用:韩侣, 周林华, 马文联, 郑伟杰, 马涛, 李天星. 基于深度学习的小样本声纹识别研究[J]. 应用数学进展, 2020, 9(1): 30-37. https://doi.org/10.12677/AAM.2020.91004

参考文献

[1] Atal, B.S. (1976) Automatic Recognition of Speakers from Their Voices. Proceedings of the IEEE, 64, 460-475.
[Google Scholar] [CrossRef
[2] Hermansky, H. (1990) Perceptual Linear Predictive (PLP) Analysis of Speech. The Journal of the Acoustical Society of America, 87, 1738-1752.
[Google Scholar] [CrossRef] [PubMed]
[3] Davis, S.B. (1980) Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28, 65-74.
[Google Scholar] [CrossRef
[4] Furui, S. (1981) Cepstral Analysis Technique for Automatic Speaker Verification. IEEE Transactions on Acoustics Speech and Signal Processing, 29, 254-272.
[Google Scholar] [CrossRef
[5] Burton, D.K. (1987) Text-Dependent Speaker Verification Using Vector Quantization Source Coding. IEEE Transactions on Acoustics Speech and Signal Processing, 35, 133-143.
[Google Scholar] [CrossRef
[6] Soong, F.K. and Rosenberg, A.E. (1988) On the Use of Instantaneous and Transitional Spectral Information in Speaker Recognition. IEEE Transactions on Acoustics Speech & Signal Processing, 36, 871-879.
[Google Scholar] [CrossRef
[7] Naik, J.M., Netsch, L.P. and Doddington, G.R. (1989) Speaker Verification over Long Distance Telephone Lines. International Conference on Acoustics, Speech, and Signal Processing, Glasgow, 23-26 May 1989, 524-527.
[8] Yang, Y., Ren, W., Hui, Z., et al. (2012) The Research of Voiceprint Recognition Based on Genetic Optimized RBF Neural Networks. IEEE International Conference on Computer Science & Automation Engineering, Zhangjiajie, 25-27 May 2012, 425-428.
[Google Scholar] [CrossRef
[9] Abu, M.A., Zakariya, Q., et al. (2018) New Transformed Features Generated by Deep Bottleneck Extractor and a GMM-UBM Classifier for Speaker Age and Gender Classification. Neural Computing & Applications, 30, 2581-2593.
[Google Scholar] [CrossRef] [PubMed]
[10] Variani, E., Lei, X., McDermott, E., Lopez Moreno, I. and Gonzalez-Dominguez, J. (2014) Deep Neural Networks for Small Footprint Text-Dependent Speaker Verification. IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, 4-9 May 2014, 4052-4056.
[Google Scholar] [CrossRef
[11] Hinton, G., Deng, L., Yu, D., et al. (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29, 82-97.
[Google Scholar] [CrossRef
[12] 田垚, 蔡猛, 何亮, 刘加. 基于深度神经网络和Bottleneck特征的说话人识别系统[J]. 清华大学学报(自然科学版), 2016, 56(11): 1143-1148.
[13] 闫河, 董莺艳, 王鹏, 罗成, 李焕. 基于CNN-LSTM网络的声纹识别研究[J]. 计算机应用与软件, 2019, 36(4): 166-170.
[14] Snyder, D., et al. (2017) Deep Neural Network-Based Speaker Embeddings for End-to-End Speaker Verification. Spoken Language Technology Workshop, San Diego, 13-16 December 2016, 165-170.
[15] 张春霞, 姬楠楠, 王冠伟. 受限玻尔兹曼机[J]. 工程数学学报, 2015, 32(2): 161-175.
[16] Hearst, M.A. (1998) Support Vector Machines. IEEE Intelligent Systems & Their Applications, 13, 18-28.
[Google Scholar] [CrossRef
[17] Ho, T.K. (1995) Random Decision Forests. International Conference on Document Analysis & Recognition, Montreal, 14-16 August 1995, 278-282.