BVTS:基于BERT增强的高质量语音合成模型
BVTS: A High-Quality Speech Synthesis Model Enhanced by BERT
DOI: 10.12677/csa.2025.1511286, PDF,   
作者: 尹鹏飞:北京印刷学院信息工程学院,北京;刘雪晴:广东外语外贸大学马克思主义学院,广东 广州
关键词: 语音合成BERT模型VITSSpeech Synthesis BERT Model VITS
摘要: 近年来,语音合成(Text-to-Speech, TTS)技术在端到端建模、音质优化等方面取得显著进展,合成语音的清晰度与流畅度大幅提升,但在逼近人类真实语音质感方面仍存挑战,主要瓶颈在韵律建模、语义理解适配方面欠缺。本文提出一种基于BERT (Bidirectional Encoder Representations from Transformers)模型增强的语音合成框架——BVTS (BERT-Integrated-VITS2),模型以VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)为框架,引入多模态文本编码器,在BERT特征嵌入的引导下,通过特征级融合语音学及韵律特征,同时采用双向可逆流模型与随机时长预测器,实现对语音节奏与语速的细粒度控制。在LJ Speech数据集与自制游戏数据集上的实验结果表明,相较于当前主流模型,BVTS的平均意见得分(MOS)整体提升明显,且字符错误率(CER)更低,此模型明显提升了合成语音的表现力、自然度与可懂度。
Abstract: In recent years, Text-to-Speech (TTS) technology has achieved significant progress in end-to-end modeling and sound quality optimization, with the clarity and fluency of synthesized speech improved substantially. However, challenges remain in approaching the texture of human real speech, and the main bottlenecks lie in the insufficient prosody modeling and semantic understanding adaptation. This paper proposes a BERT (Bidirectional Encoder Representations from Transformers)-enhanced speech synthesis framework named BVTS (BERT-Integrated-VITS2). Based on the VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) framework, the model introduces a multimodal text encoder. Guided by BERT feature embedding, it fuses phonetic and prosodic features at the feature level. Meanwhile, it adopts a bidirectional reversible flow model and a random duration predictor to achieve fine-grained control over speech rhythm and speed. Experimental results on the LJ Speech dataset and the self-constructed game dataset show that compared with current mainstream models, BVTS achieves a significant overall improvement in Mean Opinion Score (MOS) and a lower Character Error Rate (CER). This model significantly enhances the expressiveness, naturalness and intelligibility of synthesized speech.
文章引用:尹鹏飞, 刘雪晴. BVTS:基于BERT增强的高质量语音合成模型[J]. 计算机科学与应用, 2025, 15(11): 85-93. https://doi.org/10.12677/csa.2025.1511286

参考文献

[1] Li, N., Liu, S., Liu, Y., Zhao, S. and Liu, M. (2019) Neural Speech Synthesis with Transformer Network. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 6706-6713. [Google Scholar] [CrossRef
[2] Tokuda, K., Zen, H. and Black, A.W. (2002) An HMM-Based Speech Synthesis System Applied to English. IEEE Speech Synthesis Workshop, Santa Monica, 13 September 2002, 227-230.
[3] Schnell, N., Peeters, G., Lemouton, S., et al. (2000) Synthesizing a Choir in Real-Time Using Pitch Synchronous Overlap Add (PSOLA).
[4] Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., et al. (2017) Tacotron: Towards End-to-End Speech Synthesis. Interspeech 2017, Stockholm, 20-24 August 2017, 4006-4010. [Google Scholar] [CrossRef
[5] Elias, I., Zen, H., Shen, J., Zhang, Y., Jia, Y., Skerry-Ryan, R.J., et al. (2021) Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling. Interspeech 2021, Brno, 30 August-3 September 2021, 141-145. [Google Scholar] [CrossRef
[6] Kim, J., Kong, J. and Son, J. (2021) Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. International Conference on Machine Learning, PMLR, 18-24 July 2021, 5530-5540.
[7] Kong, J., Park, J., Kim, B., Kim, J., Kong, D. and Kim, S. (2023) VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design. Interspeech 2023, Dublin, 20-24 August 2023, 4374-4378. [Google Scholar] [CrossRef
[8] Ju, Z., Wang, Y., Shen, K., et al. (2024) Naturalspeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models.
[9] Peng, P., Huang, P., Li, S., Mohamed, A. and Harwath, D. (2024) Voicecraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Vol. 1, 12442-12462. [Google Scholar] [CrossRef
[10] Wang, K., Zhang, G., Zhou, Z., et al. (2025) A Comprehensive Survey in LLM (-Agent) Full Stack Safety: Data, Training and Deployment.
[11] Li, J. and Zhang, L. (2023) ZSE-VITS: A Zero-Shot Expressive Voice Cloning Method Based on Vits. Electronics, 12, Article No. 820. [Google Scholar] [CrossRef
[12] Yamamoto, R., Song, E. and Kim, J. (2020) Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, 4-8 May 2020, 6199-6203. [Google Scholar] [CrossRef
[13] Kim, J., Kim, S., Kong, J., et al. (2020) Glow-tts: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. Annual Conference on Neural Information Processing Systems 2020, 6-12 December 2020, 8067-8077.
[14] Koroteev, M.V. (2021) BERT: A Review of Applications in Natural Language Processing and Understanding.
[15] Csikszentmihalyi, M., Abuhamdeh, S. and Nakamura, J. (2014) Flow. In: Csikszentmihalyi, M., Ed., Flow and the Foundations of Positive Psychology: The Collected Works of Mihaly Csikszentmihalyi, Springer, 227-238.