|
[1]
|
Li, N., Liu, S., Liu, Y., Zhao, S. and Liu, M. (2019) Neural Speech Synthesis with Transformer Network. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 6706-6713. [Google Scholar] [CrossRef]
|
|
[2]
|
Tokuda, K., Zen, H. and Black, A.W. (2002) An HMM-Based Speech Synthesis System Applied to English. IEEE Speech Synthesis Workshop, Santa Monica, 13 September 2002, 227-230.
|
|
[3]
|
Schnell, N., Peeters, G., Lemouton, S., et al. (2000) Synthesizing a Choir in Real-Time Using Pitch Synchronous Overlap Add (PSOLA).
|
|
[4]
|
Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., et al. (2017) Tacotron: Towards End-to-End Speech Synthesis. Interspeech 2017, Stockholm, 20-24 August 2017, 4006-4010. [Google Scholar] [CrossRef]
|
|
[5]
|
Elias, I., Zen, H., Shen, J., Zhang, Y., Jia, Y., Skerry-Ryan, R.J., et al. (2021) Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling. Interspeech 2021, Brno, 30 August-3 September 2021, 141-145. [Google Scholar] [CrossRef]
|
|
[6]
|
Kim, J., Kong, J. and Son, J. (2021) Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. International Conference on Machine Learning, PMLR, 18-24 July 2021, 5530-5540.
|
|
[7]
|
Kong, J., Park, J., Kim, B., Kim, J., Kong, D. and Kim, S. (2023) VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design. Interspeech 2023, Dublin, 20-24 August 2023, 4374-4378. [Google Scholar] [CrossRef]
|
|
[8]
|
Ju, Z., Wang, Y., Shen, K., et al. (2024) Naturalspeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models.
|
|
[9]
|
Peng, P., Huang, P., Li, S., Mohamed, A. and Harwath, D. (2024) Voicecraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Vol. 1, 12442-12462. [Google Scholar] [CrossRef]
|
|
[10]
|
Wang, K., Zhang, G., Zhou, Z., et al. (2025) A Comprehensive Survey in LLM (-Agent) Full Stack Safety: Data, Training and Deployment.
|
|
[11]
|
Li, J. and Zhang, L. (2023) ZSE-VITS: A Zero-Shot Expressive Voice Cloning Method Based on Vits. Electronics, 12, Article No. 820. [Google Scholar] [CrossRef]
|
|
[12]
|
Yamamoto, R., Song, E. and Kim, J. (2020) Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, 4-8 May 2020, 6199-6203. [Google Scholar] [CrossRef]
|
|
[13]
|
Kim, J., Kim, S., Kong, J., et al. (2020) Glow-tts: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. Annual Conference on Neural Information Processing Systems 2020, 6-12 December 2020, 8067-8077.
|
|
[14]
|
Koroteev, M.V. (2021) BERT: A Review of Applications in Natural Language Processing and Understanding.
|
|
[15]
|
Csikszentmihalyi, M., Abuhamdeh, S. and Nakamura, J. (2014) Flow. In: Csikszentmihalyi, M., Ed., Flow and the Foundations of Positive Psychology: The Collected Works of Mihaly Csikszentmihalyi, Springer, 227-238.
|