基于迁移学习的端到端发音检错研究

doi:10.12677/CSA.2021.114091

期刊菜单

基于迁移学习的端到端发音检错研究
Research on End-to-End Pronunciation Error Detection Based on Transfer Learning

DOI: 10.12677/CSA.2021.114091, PDF,
作者: 高文明, 吴怡之, 魏新享：东华大学，上海
关键词: 音素；迁移学习；自动发音检错；CTC；长短记忆神经网络；Phoneme； Transfer Learning； Automatic Pronunciation Error Detection； CTC； Long and Short memory Neural Network

摘要: 自动发音检错是为了满足第二语言学习者发音练习的需求，而先进的自动发音检错系统通常取决于声学模型识别率。随着深度学习技术的发展，端到端声学模型算法已经逐渐成熟，为发音检错算法研究提供的新思路。本文首先构建了基于连接时序分类(Connectionist Temporal Classification, CTC)算法的端到端发音检错声学模型架构。其次，基于二语迁移现象，L2发音往往带有其母语的音素特征，本文利用迁移学习算法提高基于母语的声学模型性能，从而提高发音检错准确率。通过迁移中文母语音素特征的声学模型相比于只使用英文母语的声学模型在错误音素率上有所下降，并且训练时间减少了7.3%。在发音检错性能上检错正确率提升了2.06%。

Abstract: Automatic pronunciation error detection is to meet the needs of second language learners’ pronunciation practice, and advanced automatic pronunciation error detection systems usually depend on the recognition rate of the acoustic model. With the development of deep learning technology, end-to-end acoustic model algorithms have gradually matured, providing new ideas for the research of pronunciation error detection algorithms. This paper first builds an end-to-end pronunciation error detection acoustic model architecture based on the Connectionist Temporal Classification (CTC) algorithm. Secondly, based on the phenomenon of second language transfer, L2 pronun-ciation often has the phoneme characteristics of its native language. This paper uses transfer learning algorithms to improve the performance of the acoustic model based on the native language, thereby improving the accuracy of pronunciation error detection. Compared with the acoustic model that only uses the native English language, the acoustic model that transfers the Chinese phoneme features has a lower error phoneme rate, and the training time is reduced by 7.3%. The correct rate of error detection in pronunciation error detection performance has increased by 2.06%.

文章引用：高文明, 吴怡之, 魏新享. 基于迁移学习的端到端发音检错研究[J]. 计算机科学与应用, 2021, 11(4): 885-891. https://doi.org/10.12677/CSA.2021.114091

参考文献

[1]	Akhtar, S., Hussain, F., Raja, F.R., et al. (2020) Improving Mispronunciation Detection of Arabic Words for Non-Native Learners Using Deep Convolutional Neural Network Features. Electronics, 9, 963. [Google Scholar] [CrossRef]
[2]	Franco, H. Neumeyer, L. Ramos, M. and Bratt, H. (1999) Auto-matic Detection of Phone-Level Mispronunciation for Language Learning. Sixth European Conference on Speech Com-munication and Technology, Budapest, 5-9 September 1999, 851-854.
[3]	胡文凭. 基于深层神经网络的口语发音检测与错误分析[D]: [博士学位论文]. 合肥: 中国科学技术大学, 2016.
[4]	Majeed, M.N., Ghazanfar, M.A., et al. (2019) Mispronunciation Detection Using Deep Convolutional Neural Network Features and Transfer Learning Based Model for Arabic Phonemes. IEEE Access, 7, 52589-52608. [Google Scholar] [CrossRef]
[5]	Lo, W.-K., Qian, X.-J., et al. (2009) Implementation of an Extended Recognition Network for Mispronunciation Detection and Diagnosis in Computer-Assisted Pronunciation Training. Speech and Language Technology in Education (SLaTE 2009), 1, 1-4.
[6]	Huang, H., Xu, H., Wang, X., et al. (2015) Maximum F1-Score Discriminative Training Criterion for Automatic Mispronunciation Detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23, 787-797. [Google Scholar] [CrossRef]
[7]	Hinton, G., Deng, L., Yu, D., et al. (2012) Deep Neural Net-works for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Pro-cessing Magazine, 29, 82-97. [Google Scholar] [CrossRef]
[8]	Davis, S. and Mermelstein, P. (1980) Comparison of Parametric Representations for Mono Syllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acous-tics, Speech, and Signal Processing, 28, 357-366. [Google Scholar] [CrossRef]
[9]	Graves, A. and Schmidhuber, J. (2005) Frame Wise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures. Neural Networks, 18, 602-610. [Google Scholar] [CrossRef] [PubMed]
[10]	Oquab, M., Bottou, L., Laptev, I., et al. (2014) Learning and Transferring Mid-Level Image Representations Using Convolutional Neural Networks. IEEE Conference on Computer Vision & Pattern Recognition, Columbus, 23-28 June 2014, 1717-1724. [Google Scholar] [CrossRef]
[11]	Garofolo, J.S., Lamel, L.F., Fisher, W.M., et al. (1993) TIMIT Acoustic-Phonetic Continuous Speech Corpus. Philadelphia: Linguistic Data Consortium, LDC93S1.
[12]	标贝(北京)科技有限公司. 中文标准女声音库[EB/OL]. https://www.data-baker.com, 2016.

为你推荐

友情链接