|
[1]
|
Taylor, S.L., Mahler, M., Theobald, B.J., et al. (2012) Dynamic Units of Visual Speech. Proceedings of the 11th ACM SIGGRAPH/Eurographics Conference on Computer Animation, Lausanne, 29-31 July 2012, 275-284.
|
|
[2]
|
Xu, Y., Feng, A.W., Marsella, S. and Shapiro, A. (2013) A Practical and Configurable Lip Sync Method for Games. Proceedings of Motion on Games, Dublin, 6-8 November 2013, 131-140. [Google Scholar] [CrossRef]
|
|
[3]
|
Chen, N., Zhang, Y., Zen, H., et al. (2020) WaveGrad: Estimating Gradients for Waveform Generation. arXiv: 2009.00713.
|
|
[4]
|
Guo, Y., Chen, K., Liang, S., Liu, Y., Bao, H. and Zhang, J. (2021) AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 5764-5774. [Google Scholar] [CrossRef]
|
|
[5]
|
Peng, Z., Wu, H., Song, Z., Xu, H., Zhu, X., He, J., et al. (2023) EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 20630-20640. [Google Scholar] [CrossRef]
|
|
[6]
|
Zhang, C., Ni, S., Fan, Z., Li, H., Zeng, M., Budagavi, M., et al. (2023) 3D Talking Face with Personalized Pose Dynamics. IEEE Transactions on Visualization and Computer Graphics, 29, 1438-1449. [Google Scholar] [CrossRef] [PubMed]
|
|
[7]
|
Karras, T., Aila, T., Laine, S., Herva, A. and Lehtinen, J. (2017) Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion. ACM Transactions on Graphics, 36, 1-12. [Google Scholar] [CrossRef]
|
|
[8]
|
Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A. and Black, M.J. (2019) Capture, Learning, and Synthesis of 3D Speaking Styles. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 10093-10103. [Google Scholar] [CrossRef]
|
|
[9]
|
Richard, A., Zollhofer, M., Wen, Y., de la Torre, F. and Sheikh, Y. (2021) MeshTalk: 3D Face Animation from Speech Using Cross-Modality Disentanglement. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 1153-1162. [Google Scholar] [CrossRef]
|
|
[10]
|
Zhou, Y., Xu, Z., Landreth, C., Kalogerakis, E., Maji, S. and Singh, K. (2018) Visemenet: Audio-Driven Animator-Centric Speech Animation. ACM Transactions on Graphics, 37, 1-10. [Google Scholar] [CrossRef]
|
|
[11]
|
Taylor, S., Kim, T., Yue, Y., Mahler, M., Krahe, J., Rodriguez, A.G., et al. (2017) A Deep Learning Approach for Generalized Speech Animation. ACM Transactions on Graphics, 36, 1-11. [Google Scholar] [CrossRef]
|
|
[12]
|
Thambiraja, B., Aliakbarian, S., Cosker, D., et al. (2023) 3DiFACE: Diffusion-Based Speech-Driven 3D Facial Animation and Editing. arXiv: 2312.00870.
|
|
[13]
|
Fan, Y., Lin, Z., Saito, J., Wang, W. and Komura, T. (2022) FaceFormer: Speech-Driven 3D Facial Animation with Transformers. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 18749-18758. [Google Scholar] [CrossRef]
|
|
[14]
|
Peng, Z., Luo, Y., Shi, Y., Xu, H., Zhu, X., Liu, H., et al. (2023) SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend 3D Talking Faces. Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, 29 October-3 November 2023, 5292-5301. [Google Scholar] [CrossRef]
|
|
[15]
|
Pham, H.X., Wang, Y. and Pavlovic, V. (2022) Learning Continuous Facial Actions from Speech for Real-Time Animation. IEEE Transactions on Affective Computing, 13, 1567-1580. [Google Scholar] [CrossRef]
|
|
[16]
|
Tian, G., Yuan, Y. and Liu, Y. (2019) Audio2Face: Generating Speech/face Animation from Single Audio with Attention-Based Bidirectional LSTM Networks. 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shanghai, 8-12 July 2019, 366-371. [Google Scholar] [CrossRef]
|
|
[17]
|
倪虎. 基于Dirichlet自由变形算法的人脸表情动画技术研究[D]: [硕士学位论文]. 武汉: 武汉理工大学, 2020.
|
|
[18]
|
阳珊, 樊博, 谢磊, 等. 基于BLSTM-RNN的语音驱动逼真面部动画合成[J]. 清华大学学报(自然科学版), 2017, 57(3): 250-256.
|
|
[19]
|
Fan, B., Wang, L., Soong, F.K. and Xie, L. (2015) Photo-Real Talking Head with Deep Bidirectional LSTM. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, 19-24 April 2015, 4884-4888. [Google Scholar] [CrossRef]
|
|
[20]
|
Xing, J., Xia, M., Zhang, Y., Cun, X., Wang, J. and Wong, T. (2023) CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 12780-12790. [Google Scholar] [CrossRef]
|
|
[21]
|
Van Den Oord, A., Vinyals, O. and Kavukcuoglu, K. (2017) Neural Discrete Representation Learning. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 6309-6318.
|
|
[22]
|
Haque, K.I. and Yumak, Z. (2023) FaceXHuBERT: Text-Less Speech-Driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning. Proceedings of the 25th International Conference on Multimodal Interaction, Paris, 9-13 October 2023, 282-291. [Google Scholar] [CrossRef]
|
|
[23]
|
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. and Ganguli, S. (2015) Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. arXiv: 1503.03585.
|
|
[24]
|
Stan, S., Haque, K.I. and Yumak, Z. (2023) FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion. ACM SIGGRAPH Conference on Motion, Interaction and Games, Rennes, 15-17 November 2023, 1-11. [Google Scholar] [CrossRef]
|
|
[25]
|
Chen, J., Liu, Y., Wang, J., et al. (2024) DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-Driven Holistic 3D Expression and Gesture Generation. arXiv: 2401.04747.
|
|
[26]
|
Ma, Z., Zhu, X., Qi, G., et al. (2024) DiffSpeaker: Speech-Driven 3D Facial Animation with Diffusion Transformer. arXiv: 2402.05712.
|
|
[27]
|
Rasul, K., Seward, C., Schuster, I. and Vollgraf, R. (2021) Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting. arXiv: 2101.12072.
|
|
[28]
|
Sun, Z., Lv, T., Ye, S., Lin, M., Sheng, J., Wen, Y., et al. (2024) DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models. ACM Transactions on Graphics, 43, 1-9. [Google Scholar] [CrossRef]
|
|
[29]
|
JALI Research. https://jaliresearch.com/
|
|
[30]
|
Ho, J., Jain, A. and Abbeel, P. (2020) Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems, 33, 6840-6851.
|
|
[31]
|
Song, Y., Sohl-Dickstein, J., Kingma, D.P., et al. (2020) Score-Based Generative Modeling through Stochastic Differential Equations. arXiv: 2011.13456.
|
|
[32]
|
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D. and Bermano, A.H. (2022) Human Motion Diffusion Model. arXiv: 2209.14916.
|
|
[33]
|
Bigioi, D., Basak, S., Stypułkowski, M., Zieba, M., Jordan, H., McDonnell, R., et al. (2024) Speech Driven Video Editing via an Audio-Conditioned Diffusion Model. Image and Vision Computing, 142, Article ID: 104911. [Google Scholar] [CrossRef]
|
|
[34]
|
Xiao, X., Liang, J., Tong, J. and Wang, H. (2024) Emergency Decision Support Techniques for Nuclear Power Plants: Current State, Challenges, and Future Trends. Energies, 17, Article 2439. [Google Scholar] [CrossRef]
|
|
[35]
|
Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., et al. (2024) MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46, 4115-4128. [Google Scholar] [CrossRef] [PubMed]
|
|
[36]
|
Baevski, A., Zhou, Y., Mohamed, A., et al. (2020) wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, 6-12 December 2020, 12449-12460.
|
|
[37]
|
Lea, C., Vidal, R., Reiter, A. and Hager, G.D. (2016) Temporal Convolutional Networks: A Unified Approach to Action Segmentation. In: Hua, G. and Jégou, H., Eds., Computer Vision—ECCV 2016 Workshops, Springer, 47-54. [Google Scholar] [CrossRef]
|
|
[38]
|
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 6000-6010.
|
|
[39]
|
Fanelli, G., Gall, J., Romsdorfer, H., Weise, T. and Van Gool, L. (2010) A 3-D Audio-Visual Corpus of Affective Communication. IEEE Transactions on Multimedia, 12, 591-598. [Google Scholar] [CrossRef]
|
|
[40]
|
Kingma, D.P. and Ba, J. (2014) Adam: A Method for Stochastic Optimization. arXiv: 1412.6980.
|