基于余弦Softmax损失的说话人验证研究
Speaker Verification Analysis Based on Cosine Softmax Loss
DOI: 10.12677/AAM.2021.1011412, PDF,    国家自然科学基金支持
作者: 谢旗旺, 周林华*:长春理工大学,数学与统计学院,吉林 长春
关键词: Softmax损失说话人嵌入说话人验证角间距Softmax Loss Speaker Embedding Speaker Verification Angle Distance
摘要: 基于神经网络提取说话人嵌入在说话人验证任务上显示出了良好的性能,然而传统的说话人嵌入网络通常采用Softmax损失作为训练标准,其说话人特征类间区分不明显。因此本文通过引入负样本对(来自不同类别的两个样本)学习改进Softmax损失,在余弦角度空间中学习不同说话人特征存在明显角度间距且同一说话人特征聚集紧密的嵌入特征。在公开数据集AISHELL数据集上进行的说话人验证实验表明,与A-softmax相比,该损失函数的等误差率更低且ROC曲线下面积更大。
Abstract: Neural network-based extraction of the speaker embedding in the speaker verification task shows good performance, however, the traditional speaker embedding network usually uses Softmax loss as the loss function, and the distinction between the speaker characteristics class is not obvious. Therefore, this paper improves Softmax loss by introducing negative sample pairs (two samples from different categories) to learn speaker embedding in cosine angle space that there is a clear angle distance between different speakers and that the same speaker features are compact. The speaker verification experiment on the public data set AISHELL dataset showed that the loss function had a lower EER and a larger area under the ROC curve than A-softmax.
文章引用:谢旗旺, 周林华. 基于余弦Softmax损失的说话人验证研究[J]. 应用数学进展, 2021, 10(11): 3883-3889. https://doi.org/10.12677/AAM.2021.1011412

参考文献

[1] Reynolds, D.A., Quatieri, T.F. and Unn, D.R. (2000) Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing, 10, 19-41. [Google Scholar] [CrossRef
[2] Trabelsi, I., Ayed, D.B. and Ellouze, N. (2016) Comparison between GMM-SVM Sequence Kernel and GMM: Application to Speech Emotion Recognition. Journal of Engineering Science and Technology, 11, 1221-1233.
[3] Kanagasundaram, A., Vogt, R., Dean, D., et al. (2011) i-Vector Based Speaker Recognition on Short Utterances. INTERSPEECH, Florence, 27-31 August 2011, 2341-2344. [Google Scholar] [CrossRef
[4] Dehak, N., Kenny, P.J., Dehak, R., et al. (2011) Front-End Factor Analysis for Speaker Verification. IEEE Transactions on Audio Speech and Language Processing, 19, 788-798. [Google Scholar] [CrossRef
[5] Snyder, D., Garcia-Romero, D., Povey, D., et al. (2017) Deep Neural Network Embeddings for Text-Independent Speaker Verification. INTERSPEECH, Stockholm, 20-24 August 2017, 999-1003. [Google Scholar] [CrossRef
[6] Snyder, D., Garcia-Romero, D., Sell, G., et al. (2018) X-Vectors: Robust DNN Embeddings for Speaker Recognition. Proc. ICASSP, Calgary, 15-20 April 2018, 5329-5333. [Google Scholar] [CrossRef
[7] Variani, E., Lei, X., Mcdermott, E., et al. (2014) Deep Neural Networks for Small Footprint Text-Dependent Speaker Verification. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, 4-9 May 2014, 4052-4056. [Google Scholar] [CrossRef
[8] Heigold, G., Moreno, I., Bengio, S., et al. (2016) End-To end Text-Dependent Speaker Verification. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, 20-25 March 2016, 5115-5119. [Google Scholar] [CrossRef
[9] Liu, W., Wen, Y., Yu, Z., et al. (2016) Large-Margin Softmax Loss for Convolutional Neural Networks. The 33rd International Conference on Machine Learning (ICML 2016), New York, 19-24 June 2016, 507-516.
[10] Liu, W., Wen, Y., Yu, Z., et al. (2017) Sphereface: Deep Hypersphere Embedding for Face Recognition. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 212-220. [Google Scholar] [CrossRef
[11] Wang, F., Xiang, X., Cheng, J., et al. (2017) NormFace: L2 Hypersphere Embedding for Face Verification. Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, 23-27 October 2017, 1041-1049. [Google Scholar] [CrossRef
[12] Wang, H., Wang, Y., Zhou, Z., et al. (2018) CosFace: Large Margin Cosine Loss for Deep Face Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 5265-5274. [Google Scholar] [CrossRef
[13] Wang, F., Liu, W., Liu, H., et al. (2018) Additive Margin Softmax for Face Verification. IEEE Signal Processing Letters, 25, 926-930. [Google Scholar] [CrossRef
[14] Huang, Z., Wang, S. and Yu, K. (2018) Angular Softmax for Short Duration Text-Independent Speaker Verification. INTERSPEECH, Salt Lake City, 18-23 June 2018, 3623-3627. [Google Scholar] [CrossRef
[15] Yu, Y.Q., Fan, L. and Li, W.J. (2019) Ensemble Additive Margin Softmax for Speaker Verification. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, 12-17 May 2019, 6046-6050. [Google Scholar] [CrossRef
[16] Li, Y., Gao, F., Ou, Z., et al. (2019) Angular Softmax Loss for End-to-End Speaker Verification. 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei, 26-29 November 2018, 190-194. [Google Scholar] [CrossRef
[17] Bredin, H. (2017) Tristounet: Triplet Loss for Speaker Turn Embedding. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, 5-9 March 2017, 5430-5434. [Google Scholar] [CrossRef
[18] He, K., Zhang, X., Ren, S., et al. (2016) Deep Residual Learning for Image Recognition. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 770-778. [Google Scholar] [CrossRef
[19] Bu, H., Du, J., Na, X., et al. (2017) AISHELL-1: An Open-Source Mandarin Speech Corpus and a Speech Recognition Baseline. O-COCOSDA, Seoul, 1-3 November 2017, 1-5. [Google Scholar] [CrossRef