一种节奏与内容解纠缠的语音克隆模型

doi:10.12677/AIRR.2024.131018

期刊菜单

一种节奏与内容解纠缠的语音克隆模型
A Voice Cloning Model for Rhythm and Content De-Entanglement

DOI: 10.12677/AIRR.2024.131018, PDF, 科研立项经费支持
作者: 王萌, 姜丹, 曹少中：北京印刷学院信息工程学院，北京
关键词: 语音克隆；零样本；扬声器表示；内容增强；Voice Cloning； Zero-Shot； Speaker Representation； Content Enhance

摘要: 语音克隆是一种通过语音分析、说话人分类和语音编码等算法合成与参考语音非常相似的语音技术。为了增强说话人个人发音特征转移情况，提出了节奏与内容解纠缠的MRCD模型。通过节奏随机扰动模块的随机阈值重采样将语音信号所传递的节奏信息解纠缠，使语音节奏相互独立；利用梅尔内容增强模块获取说话人的相似发言特征内容，同时增加风格损失函数及循环一致性损失函数衡量生成的语音与源语音的谱图及说话人身份之间差异，最后用端到端的语音合成模型FastSpeech2进行语音克隆。为了进行实验评估，将该方法应用于公开的AISHELL3数据集进行语音转换任务。通过客观和主观评价指标对该模型进行评估，结果表明，转换后的语音在保持自然度得分的同时，在说话人相似度方面优于之前的方法。

Abstract: Voice cloning is a technique for synthesizing speech that closely resembles a reference speech through algorithms such as speech analysis, speaker classification, and voice coding. To improve the transfer of individual speaker articulatory features, the MRCD model with rhythm and content de-entanglement is proposed. The rhythmic information carried by the speech signal is de-entangled by the random threshold resampling of the rhythmic random perturbation module, so that the speech rhythms are independent of each other; the content of the speaker’s similar speech features is obtained by using the Meier content enhancement module, and at the same time the stylistic and cyclic consistency loss functions are added to measure the differences between the generated speech and the spectrograms of the source speech and the speaker’s identity, and then finally the speaker is identified by an end-to-end speech synthesis model, FastSpeech2. Finally, an end-to-end speech synthesis model, FastSpeech2, is used for speech cloning. For experimental evaluation, the method was applied to the publicly available AISHELL3 dataset for the speech cloning task. The model is evaluated using objective and subjective evaluation metrics, and the results show that the converted speech outperforms the previous method in terms of speaker similarity while maintaining the naturalness score.

文章引用：王萌, 姜丹, 曹少中. 一种节奏与内容解纠缠的语音克隆模型[J]. 人工智能与机器人研究, 2024, 13(1): 166-176. https://doi.org/10.12677/AIRR.2024.131018

参考文献

[1]	Sproat, R.W. and Olive, J.P. (1995) Text-to-Speech Synthesis. AT&T Technical Journal, 74, 35-44. [Google Scholar] [CrossRef]
[2]	Olive, J.P. (1977) Rule Synthesis of Speech from Dyadic Units. IEEE International Conference on Acoustics, Speech, and Signal Processing, Hartford, 9-11 May 1977, 568-570. [Google Scholar] [CrossRef]
[3]	Zen, H., Tokuda, K. and Black, A.W. (2009) Statistical Parametric Speech Synthesis. Speech Communication, 51, 1039-1064. [Google Scholar] [CrossRef]
[4]	Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z. and Wu, Y. (2018) Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, 15-20 April 2018, 4779-4783. [Google Scholar] [CrossRef]
[5]	Wu, Y.C., Hayashi, T., Tobing, P.L., Kobayashi, K. and Toda, T. (2021) Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model with Pitch-Dependent Dilated Convolution Neural Network. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 1134-1148. [Google Scholar] [CrossRef]
[6]	Prenger, R., Valle, R. and Catanzaro, B. (2019) Waveglow: A Flow-Based Generative Network for Speech Synthesis. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, 12-17 May 2019, 3617-3621. [Google Scholar] [CrossRef]
[7]	Kong, J., Kim, J. and Bae, J. (2020) Hifi-gan: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Advances in Neural Information Processing Systems, Vol. 33, 17022-17033.
[8]	Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z. and Liu, T.Y. (2020) Fastspeech 2: Fast and High-Quality End-to-End Text to Speech.
[9]	Choi, S., Han, S., Kim, D. and Ha, S. (2020) Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding. Proceedings Interspeech 2020, Shanghai, 25-29 October 2020, 2007-2011. [Google Scholar] [CrossRef]
[10]	An, X., Soong, F.K. and Xie, L. (2022) Disentangling Style and Speaker Attributes for TTS Style Transfer. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 646-658. [Google Scholar] [CrossRef]
[11]	Zhou, Y., Song, C., Li, X., Zhang, L., Wu, Z., Bian, Y. and Meng, H. (2022) Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis. Proceedings Interspeech 2022, Incheon, 8-22 September 2022, 2573-2577. [Google Scholar] [CrossRef]
[12]	Miao, Y. and Metze, F. (2015) On Speaker Adaptation of Long Short-Term Memory Recurrent Neural Networks. In 16th Annual Conference of the International Speech Communication Association, Dresden, 6-10 September 2015, 1101-1105. [Google Scholar] [CrossRef]
[13]	Cooper, E., Lai, C.I., Yasuda, Y., Fang, F., Wang, X., Chen, N. and Yamagishi, J. (2020) Zero-Shot Multi-Speaker Text-to-Speech with State-of-the-Art Neural Speaker Embeddings. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, 4-8 May 2020, 6184-6188. [Google Scholar] [CrossRef]
[14]	Li, X., Song, C., Li, J., Wu, Z., Jia, J. and Meng, H. (2021) Towards Multi-Scale Style Control for Expressive Speech Synthesis. Proceedings Interspeech 2021, Brno, 30 August-3 September 2021, 4673-4677. [Google Scholar] [CrossRef]
[15]	Hsu, W.N., Zhang, Y., Weiss, R.J., et al. (2019) Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, 12-17 May 2019, 5901-5905. [Google Scholar] [CrossRef]
[16]	Fang, W., Chung, Y.A. and Glass, J. (2019) Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.
[17]	Qian, K., Zhang, Y., Chang, S., Xiong, J., Gan, C., Cox, D. and Hasegawa-Johnson, M. (2021) Global Prosody Style Transfer without Text Transcriptions. Proceedings of Machine Learning Research, 139, 8650-8660.
[18]	Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K., Ping, W. and Zhou, Y. (2017) Deep Voice 2: Multi-Speaker Neural Text-to-Speech. NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 3-8 December 2018, 2966-2974.
[19]	Xue, L., Pan, S., He, L., Xie, L. and Soong, F.K. (2021) Cycle Consistent Network for End-to-End Style Transfer TTS Training. Neural Networks, 140, 223-236. [Google Scholar] [CrossRef] [PubMed]
[20]	Shi, Y., Bu, H., Xu, X., Zhang, S. and Li, M. (2020) Aishell-3: A Multi-Speaker Mandarin TTS Corpus and the Baselines. Proceedings Interspeech 2021, Brno, 30 August-3 September 2021, 2756-2760. [Google Scholar] [CrossRef]
[21]	Pypinyin. https://pypi.org/project/pypinyin
[22]	Wan, L., Wang, Q., Papir, A. and Moreno, I.L. (2018) Generalized End-to-End Loss for Speaker Verification. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, 15-20 April 2018, 4879-4883. [Google Scholar] [CrossRef]

为你推荐

友情链接