Self-Diffuser:语音驱动人脸表情的技术研究
Self-Diffuser: Research on the Technology of Speech-Driven Facial Expressions
DOI: 10.12677/csa.2024.148181, PDF,   
作者: 臧梦利, 王少波, 智 宇:温州大学计算机与人工智能学院,元宇宙与人工智能研究中心,浙江 温州;陈 昂:温州大学计算机与人工智能学院,元宇宙与人工智能研究中心,浙江 温州;温州大学元宇宙与人工智能研究院,浙江 温州
关键词: wav2vec 2.0Transformer扩散机制语音驱动面部动画wav2vec 2.0 Transformer Diffusion Model Speech-Driven Facial Animation
摘要: 先前的语音驱动面部表情的动画研究从音频信号中产生了较为逼真和精确的嘴唇运动和面部表情。传统的方法主要集中在学习从语音到动画的确定性映射,最近的研究开始探讨语音驱动的3D人脸动画的多样性,即通过利用扩散模型的多样性能力来捕捉音频和面部运动之间复杂的多对多关系来完成任务。本文的Self-Diffuser方法使用预训练的大语言模型wav2vec 2.0对音频输入进行编码,通过引入基于扩散的技术,将其与Transformer相结合来完成生成任务。本研究不仅克服了传统回归模型在生成具有唇读可理解性的真实准确唇运动方面的局限性,还探讨了精确的嘴唇同步和创造与语音无关的面部表情之间的权衡。通过对比、分析当前最先进的方法,本文的Self-Diffuser方法,使得语音驱动的面部动画产生了更精确的唇运动;在与说话松散相关的上半部表情方面也产生了更贴近于真实说话表情的面部运动;同时本文模型引入的扩散机制使得生成3D人脸动画序列的多样性能力也大大提高。
Abstract: Previous research on speech-driven facial expression animation has achieved realistic and accurate lip movements and facial expressions from audio signals. Traditional methods primarily focused on learning deterministic mappings from speech to animation. Recent studies have started exploring the diversity of speech-driven 3D facial animation, aiming to capture the complex many-to-many relationships between audio and facial motion by leveraging the diversity capabilities of diffusion models. In this study, the Self-Diffuser method is proposed by utilizing the pre-trained large-scale language model wav2vec 2.0 to encode audio inputs. By introducing diffusion-based techniques and combining them with Transformers, the generation task is accomplished. This research not only overcomes the limitations of traditional regression models in generating lip movements that are both realistic and lip-reading comprehensible, but also explores the trade-off between precise lip synchronization and creating facial expressions independent of speech. Through comparisons and analysis with the current state-of-the-art methods, the Self-Diffuser method in this paper achieves more accurate lip movements in speech-driven facial animation. It also produces facial motions that closely resemble real speaking expressions in the upper face region correlated with speech looseness. Additionally, the introduced diffusion mechanism significantly enhances the diversity capabilities in generating 3D facial animation sequences.
文章引用:臧梦利, 王少波, 智宇, 陈昂. Self-Diffuser:语音驱动人脸表情的技术研究[J]. 计算机科学与应用, 2024, 14(8): 236-249. https://doi.org/10.12677/csa.2024.148181

参考文献

[1] Taylor, S.L., Mahler, M., Theobald, B.J., et al. (2012) Dynamic Units of Visual Speech. Proceedings of the 11th ACM SIGGRAPH/Eurographics Conference on Computer Animation, Lausanne, 29-31 July 2012, 275-284.
[2] Xu, Y., Feng, A.W., Marsella, S. and Shapiro, A. (2013) A Practical and Configurable Lip Sync Method for Games. Proceedings of Motion on Games, Dublin, 6-8 November 2013, 131-140. [Google Scholar] [CrossRef
[3] Chen, N., Zhang, Y., Zen, H., et al. (2020) WaveGrad: Estimating Gradients for Waveform Generation. arXiv: 2009.00713.
[4] Guo, Y., Chen, K., Liang, S., Liu, Y., Bao, H. and Zhang, J. (2021) AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 5764-5774. [Google Scholar] [CrossRef
[5] Peng, Z., Wu, H., Song, Z., Xu, H., Zhu, X., He, J., et al. (2023) EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 20630-20640. [Google Scholar] [CrossRef
[6] Zhang, C., Ni, S., Fan, Z., Li, H., Zeng, M., Budagavi, M., et al. (2023) 3D Talking Face with Personalized Pose Dynamics. IEEE Transactions on Visualization and Computer Graphics, 29, 1438-1449. [Google Scholar] [CrossRef] [PubMed]
[7] Karras, T., Aila, T., Laine, S., Herva, A. and Lehtinen, J. (2017) Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion. ACM Transactions on Graphics, 36, 1-12. [Google Scholar] [CrossRef
[8] Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A. and Black, M.J. (2019) Capture, Learning, and Synthesis of 3D Speaking Styles. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 10093-10103. [Google Scholar] [CrossRef
[9] Richard, A., Zollhofer, M., Wen, Y., de la Torre, F. and Sheikh, Y. (2021) MeshTalk: 3D Face Animation from Speech Using Cross-Modality Disentanglement. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 1153-1162. [Google Scholar] [CrossRef
[10] Zhou, Y., Xu, Z., Landreth, C., Kalogerakis, E., Maji, S. and Singh, K. (2018) Visemenet: Audio-Driven Animator-Centric Speech Animation. ACM Transactions on Graphics, 37, 1-10. [Google Scholar] [CrossRef
[11] Taylor, S., Kim, T., Yue, Y., Mahler, M., Krahe, J., Rodriguez, A.G., et al. (2017) A Deep Learning Approach for Generalized Speech Animation. ACM Transactions on Graphics, 36, 1-11. [Google Scholar] [CrossRef
[12] Thambiraja, B., Aliakbarian, S., Cosker, D., et al. (2023) 3DiFACE: Diffusion-Based Speech-Driven 3D Facial Animation and Editing. arXiv: 2312.00870.
[13] Fan, Y., Lin, Z., Saito, J., Wang, W. and Komura, T. (2022) FaceFormer: Speech-Driven 3D Facial Animation with Transformers. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 18749-18758. [Google Scholar] [CrossRef
[14] Peng, Z., Luo, Y., Shi, Y., Xu, H., Zhu, X., Liu, H., et al. (2023) SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend 3D Talking Faces. Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, 29 October-3 November 2023, 5292-5301. [Google Scholar] [CrossRef
[15] Pham, H.X., Wang, Y. and Pavlovic, V. (2022) Learning Continuous Facial Actions from Speech for Real-Time Animation. IEEE Transactions on Affective Computing, 13, 1567-1580. [Google Scholar] [CrossRef
[16] Tian, G., Yuan, Y. and Liu, Y. (2019) Audio2Face: Generating Speech/face Animation from Single Audio with Attention-Based Bidirectional LSTM Networks. 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shanghai, 8-12 July 2019, 366-371. [Google Scholar] [CrossRef
[17] 倪虎. 基于Dirichlet自由变形算法的人脸表情动画技术研究[D]: [硕士学位论文]. 武汉: 武汉理工大学, 2020.
[18] 阳珊, 樊博, 谢磊, 等. 基于BLSTM-RNN的语音驱动逼真面部动画合成[J]. 清华大学学报(自然科学版), 2017, 57(3): 250-256.
[19] Fan, B., Wang, L., Soong, F.K. and Xie, L. (2015) Photo-Real Talking Head with Deep Bidirectional LSTM. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, 19-24 April 2015, 4884-4888. [Google Scholar] [CrossRef
[20] Xing, J., Xia, M., Zhang, Y., Cun, X., Wang, J. and Wong, T. (2023) CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 12780-12790. [Google Scholar] [CrossRef
[21] Van Den Oord, A., Vinyals, O. and Kavukcuoglu, K. (2017) Neural Discrete Representation Learning. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 6309-6318.
[22] Haque, K.I. and Yumak, Z. (2023) FaceXHuBERT: Text-Less Speech-Driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning. Proceedings of the 25th International Conference on Multimodal Interaction, Paris, 9-13 October 2023, 282-291. [Google Scholar] [CrossRef
[23] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. and Ganguli, S. (2015) Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. arXiv: 1503.03585.
[24] Stan, S., Haque, K.I. and Yumak, Z. (2023) FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion. ACM SIGGRAPH Conference on Motion, Interaction and Games, Rennes, 15-17 November 2023, 1-11. [Google Scholar] [CrossRef
[25] Chen, J., Liu, Y., Wang, J., et al. (2024) DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-Driven Holistic 3D Expression and Gesture Generation. arXiv: 2401.04747.
[26] Ma, Z., Zhu, X., Qi, G., et al. (2024) DiffSpeaker: Speech-Driven 3D Facial Animation with Diffusion Transformer. arXiv: 2402.05712.
[27] Rasul, K., Seward, C., Schuster, I. and Vollgraf, R. (2021) Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting. arXiv: 2101.12072.
[28] Sun, Z., Lv, T., Ye, S., Lin, M., Sheng, J., Wen, Y., et al. (2024) DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models. ACM Transactions on Graphics, 43, 1-9. [Google Scholar] [CrossRef
[29] JALI Research.
https://jaliresearch.com/
[30] Ho, J., Jain, A. and Abbeel, P. (2020) Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems, 33, 6840-6851.
[31] Song, Y., Sohl-Dickstein, J., Kingma, D.P., et al. (2020) Score-Based Generative Modeling through Stochastic Differential Equations. arXiv: 2011.13456.
[32] Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D. and Bermano, A.H. (2022) Human Motion Diffusion Model. arXiv: 2209.14916.
[33] Bigioi, D., Basak, S., Stypułkowski, M., Zieba, M., Jordan, H., McDonnell, R., et al. (2024) Speech Driven Video Editing via an Audio-Conditioned Diffusion Model. Image and Vision Computing, 142, Article ID: 104911. [Google Scholar] [CrossRef
[34] Xiao, X., Liang, J., Tong, J. and Wang, H. (2024) Emergency Decision Support Techniques for Nuclear Power Plants: Current State, Challenges, and Future Trends. Energies, 17, Article 2439. [Google Scholar] [CrossRef
[35] Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., et al. (2024) MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46, 4115-4128. [Google Scholar] [CrossRef] [PubMed]
[36] Baevski, A., Zhou, Y., Mohamed, A., et al. (2020) wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, 6-12 December 2020, 12449-12460.
[37] Lea, C., Vidal, R., Reiter, A. and Hager, G.D. (2016) Temporal Convolutional Networks: A Unified Approach to Action Segmentation. In: Hua, G. and Jégou, H., Eds., Computer VisionECCV 2016 Workshops, Springer, 47-54. [Google Scholar] [CrossRef
[38] Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 6000-6010.
[39] Fanelli, G., Gall, J., Romsdorfer, H., Weise, T. and Van Gool, L. (2010) A 3-D Audio-Visual Corpus of Affective Communication. IEEE Transactions on Multimedia, 12, 591-598. [Google Scholar] [CrossRef
[40] Kingma, D.P. and Ba, J. (2014) Adam: A Method for Stochastic Optimization. arXiv: 1412.6980.