|
[1]
|
Sproat, R.W. and Olive, J.P. (1995) Text-to-Speech Synthesis. AT&T Technical Journal, 74, 35-44. [Google Scholar] [CrossRef]
|
|
[2]
|
Olive, J.P. (1977) Rule Synthesis of Speech from Dyadic Units. IEEE International Conference on Acoustics, Speech, and Signal Processing, Hartford, 9-11 May 1977, 568-570. [Google Scholar] [CrossRef]
|
|
[3]
|
Zen, H., Tokuda, K. and Black, A.W. (2009) Statistical Parametric Speech Synthesis. Speech Communication, 51, 1039-1064. [Google Scholar] [CrossRef]
|
|
[4]
|
Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z. and Wu, Y. (2018) Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, 15-20 April 2018, 4779-4783. [Google Scholar] [CrossRef]
|
|
[5]
|
Wu, Y.C., Hayashi, T., Tobing, P.L., Kobayashi, K. and Toda, T. (2021) Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model with Pitch-Dependent Dilated Convolution Neural Network. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 1134-1148. [Google Scholar] [CrossRef]
|
|
[6]
|
Prenger, R., Valle, R. and Catanzaro, B. (2019) Waveglow: A Flow-Based Generative Network for Speech Synthesis. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, 12-17 May 2019, 3617-3621. [Google Scholar] [CrossRef]
|
|
[7]
|
Kong, J., Kim, J. and Bae, J. (2020) Hifi-gan: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Advances in Neural Information Processing Systems, Vol. 33, 17022-17033.
|
|
[8]
|
Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z. and Liu, T.Y. (2020) Fastspeech 2: Fast and High-Quality End-to-End Text to Speech.
|
|
[9]
|
Choi, S., Han, S., Kim, D. and Ha, S. (2020) Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding. Proceedings Interspeech 2020, Shanghai, 25-29 October 2020, 2007-2011. [Google Scholar] [CrossRef]
|
|
[10]
|
An, X., Soong, F.K. and Xie, L. (2022) Disentangling Style and Speaker Attributes for TTS Style Transfer. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 646-658. [Google Scholar] [CrossRef]
|
|
[11]
|
Zhou, Y., Song, C., Li, X., Zhang, L., Wu, Z., Bian, Y. and Meng, H. (2022) Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis. Proceedings Interspeech 2022, Incheon, 8-22 September 2022, 2573-2577. [Google Scholar] [CrossRef]
|
|
[12]
|
Miao, Y. and Metze, F. (2015) On Speaker Adaptation of Long Short-Term Memory Recurrent Neural Networks. In 16th Annual Conference of the International Speech Communication Association, Dresden, 6-10 September 2015, 1101-1105. [Google Scholar] [CrossRef]
|
|
[13]
|
Cooper, E., Lai, C.I., Yasuda, Y., Fang, F., Wang, X., Chen, N. and Yamagishi, J. (2020) Zero-Shot Multi-Speaker Text-to-Speech with State-of-the-Art Neural Speaker Embeddings. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, 4-8 May 2020, 6184-6188. [Google Scholar] [CrossRef]
|
|
[14]
|
Li, X., Song, C., Li, J., Wu, Z., Jia, J. and Meng, H. (2021) Towards Multi-Scale Style Control for Expressive Speech Synthesis. Proceedings Interspeech 2021, Brno, 30 August-3 September 2021, 4673-4677. [Google Scholar] [CrossRef]
|
|
[15]
|
Hsu, W.N., Zhang, Y., Weiss, R.J., et al. (2019) Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, 12-17 May 2019, 5901-5905. [Google Scholar] [CrossRef]
|
|
[16]
|
Fang, W., Chung, Y.A. and Glass, J. (2019) Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.
|
|
[17]
|
Qian, K., Zhang, Y., Chang, S., Xiong, J., Gan, C., Cox, D. and Hasegawa-Johnson, M. (2021) Global Prosody Style Transfer without Text Transcriptions. Proceedings of Machine Learning Research, 139, 8650-8660.
|
|
[18]
|
Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K., Ping, W. and Zhou, Y. (2017) Deep Voice 2: Multi-Speaker Neural Text-to-Speech. NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 3-8 December 2018, 2966-2974.
|
|
[19]
|
Xue, L., Pan, S., He, L., Xie, L. and Soong, F.K. (2021) Cycle Consistent Network for End-to-End Style Transfer TTS Training. Neural Networks, 140, 223-236. [Google Scholar] [CrossRef] [PubMed]
|
|
[20]
|
Shi, Y., Bu, H., Xu, X., Zhang, S. and Li, M. (2020) Aishell-3: A Multi-Speaker Mandarin TTS Corpus and the Baselines. Proceedings Interspeech 2021, Brno, 30 August-3 September 2021, 2756-2760. [Google Scholar] [CrossRef]
|
|
[21]
|
Pypinyin. https://pypi.org/project/pypinyin
|
|
[22]
|
Wan, L., Wang, Q., Papir, A. and Moreno, I.L. (2018) Generalized End-to-End Loss for Speaker Verification. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, 15-20 April 2018, 4879-4883. [Google Scholar] [CrossRef]
|