基于语音参数自适应的缅甸语情感语音合成
Burmese Emotional Speech Synthesis Based on Speech Parameter Adaptation
摘要: 相比汉语和英语,缅甸语的语音合成技术发展相对滞后,合成的语音缺乏情感。情感语音合成使机器表达不再生涩,采用基于HMM声学模型的语音参数自适应方法,研究缅甸语情感语音合成。情感语音合成研究面临的一个困难是难以获取大规模的情感语音库,在低资源条件下提出了一种实现缅甸语情感语音合成的方法。首先在MFA (蒙特利尔强制对齐)平台进行缅甸语音子自动切分以训练语音声学模型,基于HTS平台采用中规模的缅甸语平静情感语音库,构建缅甸语语音合成基线系统。在此基础上,基于少量的高兴、悲伤、生气情感语音数据,采用语音参数自适应方法,构建缅甸语情感语音合成系统,并通过引入平均音模型和调整转换矩阵的方法进一步改进情感语音合成系统。实验结果表明,情感语音合成系统可合成出平静、高兴、悲伤、生气四种情感的缅甸语语音,EMOS平均评分可达3.40,证明了方法的有效性。
Abstract: Compared with Chinese and English, the development of Burmese speech synthesis technology is relatively lagging, and the synthesized speech lacks emotion. Emotional speech synthesis makes the machine’s expression not reproducible. Using the HMM acoustic model-based speech parameter adaptation method, the Burmese emotional speech synthesis is studied. One of the difficulties faced by the research of emotional speech synthesis is that it is difficult to obtain a large-scale emotional speech library. Under the condition of low resources, a method to realize the emotional speech synthesis of Burmese is proposed. Firstly, the Myanmar phonetic sub-segmentation is carried out on the MFA (Montreal Force Align) platform to train the voice acoustic model. Based on the HTS platform, a medium-scale Burmese calm emotion speech library is used to construct a Burmese speech synthesis baseline system. On this basis, based on a small amount of happy, sad, and angry emotional speech data, the Burmese emotional speech synthesis system is constructed using the method of speech parameter adaptation, and the emotional speech synthesis system is further improved by introducing the average sound model and adjusting the conversion matrix. Experimental results show that the emotional speech synthesis system can synthesize Burmese speech with four emotions: calm, happy, sad, and angry, with an average EMOS score of 3.40, which proves the effectiveness of the method.
文章引用:刘奇云, 杨鉴, 谭婉琳. 基于语音参数自适应的缅甸语情感语音合成[J]. 计算机科学与应用, 2022, 12(1): 33-45. https://doi.org/10.12677/CSA.2022.121005

参考文献

[1] 钟智翔, 尹湘玲. 基础缅甸语(第一册) [M]. 广州: 世界图书出版广东有限公司, 2012.
[2] Thu, Y.K., Pa, W.P., Ni, J., Shiga, Y., Finch, A., Hori, C., et al. (2015) HMM Based Myanmar Text to Speech System. 16th Annual Confer-ence of the International Speech Communication Association, Dresden, 6-10 September 2015, 2237-2241. [Google Scholar] [CrossRef
[3] Hlaing, C.S. and Thida, A. (2017) Myanmar Speech Synthesis System by Using Phoneme Concatenation Method. 2017 International Conference on Signal Processing and Communi-cation (ICSPC), Coimbatore, 28-29 July 2017, 399-404. [Google Scholar] [CrossRef
[4] Hlaing, A.M., Pa, W.P. and Ye, K.T. (2018) DNN Based My-anmar Speech Synthesis. The 6th Workshop on Spoken Language Technologies for Under-Resourced Languages, Gurugram, 29-31 August 2018, 142-146. [Google Scholar] [CrossRef
[5] Gao, J., Chakraborty, D., Tembine, H. and Olaleye, O. (2019) Non-parallel Emotional Speech Conversion. 20th Annual Conference of the International Speech Communication Association, Graz, 15-19 September 2019, 2858-2862. [Google Scholar] [CrossRef
[6] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M. and Sonderegger, M. (2017) Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. Interspeech 2017: Conference of the International Speech Communication Association, Stockholm, 20-24 August 2017, 498-502. [Google Scholar] [CrossRef
[7] Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J. and Oura, K. (2013) Speech Synthesis Based on Hidden Markov Models. Proceedings of the IEEE, 101, 1234-1252. [Google Scholar] [CrossRef
[8] Ma, C.E. and Yang, J. (2018) Burmese Word Segmentation Method and Implementation Based on CRF. 2018 International Conference on Asian Language Processing (IALP), Bandung, 15-17 November 2018, 340-343. [Google Scholar] [CrossRef
[9] 汪大年. 缅甸语教程[M]. 北京: 北京大学出版社, 2012.
[10] 吴义坚. 基于隐马尔科夫模型的语音合成技术研究[D]: [博士学位论文]. 合肥: 中国科学技术大学, 2006.
[11] Tamura, M., Masuko, T., Tokuda, K. and Kobayashi, T. (2001) Adaptation of Pitch and Spectrum for HMM-Based Speech Synthesis Using MLLR. Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, 7-11 May 2001, 805-808. [Google Scholar] [CrossRef
[12] Yamagishi, J. (2006) Average-Voice-Based Speech Synthesis. Tokyo Institute of Technology, Tokyo.
[13] Tokuda, K., Masuko, T., Miyazaki, N. and Kobayashi, T. (1999) Hidden Markov Models based on Multi-Space Probability Distribution for Pitch Pattern Modeling. Proceedings of 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, 15-19 March 1999, 229-232. [Google Scholar] [CrossRef
[14] Viswanathan, M. and Viswanathan, M. (2005) Measuring Speech Quality for Text-to-Speech Systems: Development and Assessment of a Modified Mean Opinion Score (MOS) Scale. Computer Speech & Language, 19, 55-83. [Google Scholar] [CrossRef
[15] Streijl, R.C., Winkler, S. and Hands, D.S. (2016) Mean Opinion Score (MOS) Revisited: Methods and Applications, Limitations and Alternatives. Multimedia Systems, 22, 213-227. [Google Scholar] [CrossRef