带有先验的语音驱动三维人脸动画生成方法
Speech Driven 3D Facial Animation Generation Method with Prior Knowledge
摘要: 语音驱动的三维人脸生成是计算机视觉和图形学中一个非常有吸引力的研究课题。除了有趣之外,它还有广泛的应用,例如游戏动画、3D视频通话和AR/MR的3D化身。由于人脸运动的复杂性和不确定性,以往方法生成的结果有唇形不准确、面部动态性不佳的缺点。不同于以往一阶段的方法,我们使用一种新的两阶段的方法,在模型训练的第一阶段我们使用变分自动编码器将高维的复杂的面部映射进低维的空间,充分学习人脸运动先验。在第二阶段,Transformer根据输入的语音信号在学习到的人脸先验的基础上进行潜在代码查询,以回归的方式生成面部运动序列。这样可以降低生成面部动画的难度,减少了映射的模糊,可以在任意指定音频上得到生动的人脸说话动画,经验证我们的方法与先进的方法相比在唇形和脸部动态性上取得优势。
Abstract: Speech-driven 3D facial animation is a very attractive research topic in computer vision and graphics. In addition to being interesting, it has a wide range of applications, such as game anima-tion, 3D video calls, and 3D avatars of AR/MR. Due to the complexity and uncertainty of facial movements, previous methods have drawbacks such as inaccurate lip shape and poor facial dynamics. Unlike previous methods, we use a new two-stage approach. In the first stage of model training, we use a variational autoencoder to map high-dimensional complex faces into low-dimensional space, fully learning facial motion priors. In the second stage, the Transformer performs latent code queries based on the learned facial prior based on the input speech signal, and generates facial motion sequences through regression. This can reduce the difficulty of generating facial animation, reduce mapping blur, and obtain vivid facial speech animations on any specified audio. It has been verified that our method has advantages in lip shape and facial dynamics compared to advanced methods.
文章引用:吕镇宇, 夏方方, 刘芳丽, 郭润甲, 郭子俊. 带有先验的语音驱动三维人脸动画生成方法[J]. 计算机科学与应用, 2023, 13(11): 2072-2079. https://doi.org/10.12677/CSA.2023.1311206

参考文献

[1] Edwards, P., Landreth, C., Fiume, E. and Singh, K. (2016) JALI: An Animator-Centric Viseme Model for Expressive Lip Syn-chronization. ACM Transactions on Graphics, 35, 1-11. [Google Scholar] [CrossRef
[2] Xing, J.B., Xia, M.H., Zhang, Y.C., Cun, X.D., Wang, J. and Wong, T.T. (2023) CodeTalker: Speech-Driven 3D Facial Animation with Dis-crete Motion Prior. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 12780-12790. [Google Scholar] [CrossRef
[3] Peng, Z.Q., Wu, H.Y., Song, Z.B., Xu, H., Zhu, X.Y., Liu, H.Y., He, J. and Fan, Z.X. (2023) EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation. arXiv preprint arXiv: 2303.11089.
[4] 宋昕洋, 阎志远, 孙沐毅, 等. 说话人生成研究现状与发展趋势[J]. 计算机科学, 2023, 50(8): 68-78.
[5] Zhou, Y., Xu, Z., Landreth, C., Kalogerakis, E., Maji, S. and Singh, K. (2018) Visemenet: Au-dio-Driven Animator-Centric Speech Animation. ACM Transactions on Graphics, 37, 1-10. [Google Scholar] [CrossRef
[6] Richard, A., Zollhofer, M., Wen, Y.D., de la Torre, F. and Sheikh, Y. (2021) MeshTalk: 3D Face Animation from Speech Using Cross-Modality Disentanglement. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 1153-1162. [Google Scholar] [CrossRef
[7] Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A. and Black, M.J. (2019) Capture, Learning, and Synthesis of 3D Speaking Styles. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 10093-10103. [Google Scholar] [CrossRef
[8] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, and Polosukhin, I. (2017) Attention Is All You Need. Advances in Neural Information Processing Systems, 30, 2-3.
[9] Li, T., Bolkart, T., Black, M.J., Li, H. and Romero, J. (2017) Learning a Model of Facial Shape and Expression from 4D Scans. ACM Transactions on Graphics, 36, 1-17. [Google Scholar] [CrossRef
[10] van den Oord, A., Vinyals, O. and Kavukcuoglu, K. (2017) Neural Discrete Representation Learning. In: I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., Advances in Neural Information Processing Systems. Curran As-sociates, Inc., Newburyport.
[11] Baevski, A., Zhou, H., Mohamed, A. and Auli, M. (2020) Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv:2006.11477.
[12] Bai, S.J., Kolter, J.Z. and Koltun, V. (2018) An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv:1803.01271.
[13] Fan, Y., Lin, Z., Saito, J., Wang, W. and Komura, T. (2022) FaceFormer: Speech-Driven 3D Facial Animation with Transformers. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 18749-18758. [Google Scholar] [CrossRef