HBF Talk:语音驱动的3D面部动画合成研究
HBF Talk: Speech-Driven 3D Facial Animation Synthesis Research
DOI: 10.12677/csa.2024.148174, PDF,    科研立项经费支持
作者: 王文祥, 王少波, 智 宇:温州大学计算机与人工智能学院,元宇宙与人工智能研究中心,浙江 温州;陈 昂:温州大学计算机与人工智能学院,元宇宙与人工智能研究中心,浙江 温州;温州大学元宇宙与人工智能研究院,浙江 温州
关键词: Hu BERTFlashTransformer3D面部动画嘴唇运动Hu BERT Flash Transformer 3D Facial Animation Lip Movements
摘要: 近年来,语音驱动的3D面部动画得到了广泛的研究,虽然先前的工作可以从语音数据中生成连贯的3D面部动画,但是由于视听数据的稀缺性,生成的3D面部动画缺乏真实感和生动性,嘴唇运动的准确性不高。为了提高嘴唇运动的准确性和生动性,本文提出了一种新的模型HBF Talk (端到端的神经网络模型),通过使用Hu BERT (Hidden-Unit BERT)预训练模型对语音数据进行特征提取和编码,引入Flash模块对提取到的语音特征表示进行进一步的编码,获得更为丰富的语音特征上下文表示,最后使用带偏置的跨模态Transformer解码器进行解码。本文进行了定量和定性实验,并与现有的基线模型进行比较,显示本文HBF Talk模型具有更好的性能,提高了语音驱动的嘴唇运动的准确性和生动性。
Abstract: In recent years, speech-driven 3D facial animation has been widely studied. Previous work on the generation of coherent 3D facial animations was reported from speech data. However, the generated 3D facial animations lacks realism and vividness due to the scarcity of audio-visual data, and the accuracy of lip movements is not sufficient. This work is performed in order to improve the accuracy and vividness of lip movement and an end-to-end neural network model, HBF Talk, is proposed. It utilizes the Hu BERT (Hidden-Unit BERT) pre-trained model for feature extraction and encoding of speech data. The Flash module is introduced to further encode the extracted speech feature representations, resulting in more enriched contextual representations of speech features. Finally, a biased cross-modal Transformer decoder is used for decoding. This paper conducts both quantitative and qualitative experiments and compares the results with existing baseline models, demonstrating the proposed HBF Talk model outperforms previous models by improving the accuracy and liveliness of speech-driven lip movements.
文章引用:王文祥, 王少波, 智宇, 陈昂. HBF Talk:语音驱动的3D面部动画合成研究[J]. 计算机科学与应用, 2024, 14(8): 168-178. https://doi.org/10.12677/csa.2024.148174

参考文献

[1] Fisher, C.G. (1968) Confusions among Visually Perceived Consonants. Journal of Speech and Hearing Research, 11, 796-804. [Google Scholar] [CrossRef] [PubMed]
[2] Parke, F.I. (1972) Computer Generated Animation of Faces. Proceedings of the ACM Annual Conference, 1, 451-457. [Google Scholar] [CrossRef
[3] Parke, F.I. and Waters, K. (1996) Computer Facial Animation. A. K. Peters, Ltd., Natick.
[4] Li, L., Liu, Y. and Zhang, H. (2012) A Survey of Computer Facial Animation Techniques. 2012 International Conference on Computer Science and Electronics Engineering, Hangzhou, 23-25 March 2012, 434-438. [Google Scholar] [CrossRef
[5] 李代超. 基于伪肌肉向量的三维人脸动画及其驱动研究与实现[D]: [硕士学位论文]. 成都: 电子科技大学, 2011.
[6] Ekman. P. and Friesen, W.V. (1978) Facial Action Coding System (FACS): A Technique for the Measurement of Facial Actions. Rivista di Psichiatria, 47, 126-138.
[7] Zhang, M., Chen, Y., Li, L. and Wang, D. (2017) Speaker Recognition with Cough, Laugh and “Wei”. 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, 12-15 December 2017, 497-501. [Google Scholar] [CrossRef
[8] Li, P.C., et al. (2018) An Attention Pooling Based Representation Learning Method for Speech Emotion Recognition. Proceedings of INTERSPEECH, Hyderabad, 2-6 September 2018, 3087-3091.
[9] Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A. and Black, M.J. (2019) Capture, Learning, and Synthesis of 3D Speaking Styles. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 10093-10103. [Google Scholar] [CrossRef
[10] Oh, T.-H., et al. (2019) Speech2Face: Learning the Face behind a Voice.
[11] Fan, Y., Lin, Z., Saito, J., Wang, W. and Komura, T. (2022) FaceFormer: Speech-Driven 3D Facial Animation with Transformers. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 18749-18758. [Google Scholar] [CrossRef
[12] Richard, A., Zollhofer, M., Wen, Y., de la Torre, F. and Sheikh, Y. (2021) MeshTalk: 3D Face Animation from Speech Using Cross-Modality Disentanglement. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 1153-1162. [Google Scholar] [CrossRef
[13] Xing, J., Xia, M., Zhang, Y., Cun, X., Wang, J. and Wong, T. (2023) CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 12780-12790. [Google Scholar] [CrossRef
[14] Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo, Y., et al. (2023) SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 8652-8661. [Google Scholar] [CrossRef
[15] Shen, S., Zhao, W., Meng, Z., Li, W., Zhu, Z., Zhou, J., et al. (2023) DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 1982-1991. [Google Scholar] [CrossRef
[16] Hsu, W., Bolte, B., Tsai, Y.H., Lakhotia, K., Salakhutdinov, R. and Mohamed, A. (2021) Hubert: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451-3460. [Google Scholar] [CrossRef
[17] Hua, W., Dai, Z., Liu, H. and Le, Q.V. (2022) Transformer Quality in Linear Time.
[18] Panayotov, V., Chen, G., Povey, D. and Khudanpur, S. (2015) Librispeech: An ASR Corpus Based on Public Domain Audio Books. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, 19-24 April 2015, 5206-5210. [Google Scholar] [CrossRef
[19] Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019) BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.
[20] Baevski, A., Zhou, H., Mohamed, A. and Auli, M. (2020) wav2vec2.0: A Framework for Self-Supervised Learning of Speech Representations.
[21] Li, T., Bolkart, T., Black, M.J., Li, H. and Romero, J. (2017) Learning a Model of Facial Shape and Expression from 4D Scans. ACM Transactions on Graphics, 36, 1-17. [Google Scholar] [CrossRef