一种基于时序损失的语音驱动面部运动方法
A Speech-Driven Facial Motion Method Based on Temporal Loss
DOI: 10.12677/CSA.2023.1312251, PDF,    科研立项经费支持
作者: 王振凯, 王承伟, 张一帆, 李昊渊:河北地质大学信息工程学院,河北 石家庄
关键词: 语音驱动跨模态对齐面部动画Soft-DTWSpeech-Driven Cross-Modal Alignment Facial Animation Soft-DTW
摘要: 语音驱动3D面部运动的研究主要聚焦于拓展多说话人的3D面部运动数据以及获取高质量音频特征上,但采集3D面部运动数据往往需要高昂的成本和繁琐的标注流程,单一说话人的少量数据样本又会导致模型因为数据的稀疏性难以获取高质量音频特征。针对该问题,论文从时间序列任务中获得启发,将可微动态时间规整(Smoothed formulation of Dynamic Time Warping, Soft-DTW)应用到语音特征与面部网格(Mesh)顶点的跨模态对齐中。经过实验表明,采用Soft-DTW作为损失函数在生成面部动画的唇形吻合度方面相较于使用均方误差(Mean Squared Error, MSE)时有所提高,可以合成更高质量的面部动画。
Abstract: Research on voice-driven 3D facial motion primarily focuses on expanding 3D facial motion data for multiple speakers and obtaining high-quality audio features. However, the collection of 3D facial motion data often entails high costs and a labor-intensive annotation process. Additionally, having a limited amount of data samples for a single speaker can make it challenging for models to obtain high-quality audio features due to data sparsity. To address this issue, this study draws inspiration from temporal tasks and applies the concept of Smoothed Dynamic Time Warping (Soft-DTW) to the cross-modal alignment between speech features and facial mesh vertices. Experimental results have shown that using Soft-DTW as a loss function leads to improved lip synchronization in generating facial animations compared to using Mean Squared Error (MSE). This approach enables the synthesis of higher-quality facial animations.
文章引用:王振凯, 王承伟, 张一帆, 李昊渊. 一种基于时序损失的语音驱动面部运动方法[J]. 计算机科学与应用, 2023, 13(12): 2521-2527. https://doi.org/10.12677/CSA.2023.1312251

参考文献

[1] Edwards, P., Landreth, C., Fiume, E. and Singh, K. (2016) JALI: An Animator-Centric Viseme Model for Expressive Lip Synchronization. ACM Transactions on Graphics, 35, Article No. 127. [Google Scholar] [CrossRef
[2] Taylor, S.L., Mahler, M., Theobald, B.-J. and Matthews, I. (2012) Dynamic Units of Visual Speech. Proceedings of the ACM SIGGRAPH/Eurographics Conference on Computer Anima-tion, Lausanne, 29-31 July 2012, 275-284.
[3] Xu, Y.Y., Feng, A.W., et al. (2013) A Practical and Configurable Lip Sync Method for Games. Proceedings of Motion on Games, Dublin, 6-8 November 2013, 131-140. [Google Scholar] [CrossRef
[4] Sako, S., Tokuda, K., Masuko, T., et al. (2000) HMM-Based Text-To-Audio-Visual Speech Synthesis. Sixth International Conference on Spoken Language Processing, Beijing, 16-20 October 2000. [Google Scholar] [CrossRef
[5] Zhou, Y., Xu, Z., Landreth, C., et al. (2018) VisemeNet: Au-dio-Driven Animator-Centric Speech Animation. ACM Transactions on Graphics, 37, Article No. 161. [Google Scholar] [CrossRef
[6] Karras, T., Aila, T., Laine, S., et al. (2017) Audio-Driven Facial Animation by Joint End-To-End Learning of Pose and Emotion. ACM Transactions on Graphics (TOG), 36, Article No. 94. [Google Scholar] [CrossRef
[7] Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9, 1735-1780. [Google Scholar] [CrossRef] [PubMed]
[8] Schuster, M. and Paliwal, K.K. (1997) Bidirectional Recurrent Neural Networks. IEEE Transactions on Signal Processing, 45, 2673-2681. [Google Scholar] [CrossRef
[9] Cudeiro, D., Bolkart, T., Laidlaw, C., et al. (2019) Capture, Learning, and Synthesis of 3D Speaking Styles. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-tion, Long Beach, 15-20 June 2019, 10101-10111. [Google Scholar] [CrossRef
[10] Richard, A., Zollhöfer, M., Wen, Y., et al. (2021) MeshTalk: 3d Face Animation from Speech Using Cross-Modality Disentangle-ment. Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11-17 October 2021, 1173-1182. [Google Scholar] [CrossRef
[11] Fan, Y., Lin, Z., Saito, J., et al. (2022) Face-Former: Speech-Driven 3d Facial Animation with Transformers. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 18-24 June 2022, 18770-18780. [Google Scholar] [CrossRef
[12] Chen, Q., Ma, Z., Liu, T., et al. (2023) Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, 04-10 June 2023, 1-5. [Google Scholar] [CrossRef
[13] Cuturi, M. and Blondel, M. (2017) Soft-DTW: A Dif-ferentiable Loss Function for Time-Series. International Conference on Machine Learning, Sydney, 6-11 August 2017, 894-903.
[14] Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. 31st Conference on Neu-ral Information Processing Systems (NIPS 2017), Long Beach, 4-9 December 2017.
[15] Baevski, A., Zhou, Y., Mo-hamed, A., et al. (2020) wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. 34th Con-ference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, 6-12 December 2020, 12449-12460.
[16] Sakoe, H. (1971) A Dynamic-Programming Approach to Continuous Speech Recognition.
https://www.semanticscholar.org/paper/A-Dynamic-Programming-Approach-to-Continuous-Speech-Sakoe-Chiba/2d2eb229c21269ffaa8a85b0961a2bda1116a6c7#citing-papers