基于语音驱动的人脸生成
Speech-Driven Facial Generation
DOI: 10.12677/csa.2025.151020, PDF,   
作者: 李昊渊:河北地质大学信息工程学院,河北 石家庄;河北地质大学人工智能与机器学习研究室,河北 石家庄
关键词: 人脸生成深度学习Wav2vec交叉注意力机制条件卷积Facial Recognition Deep Learning Wav2vec Cross-Attention Mechanism Conditional Convolution
摘要: 语音驱动人脸生成旨在生成与参考人脸具有相同身份信息,与语音内容相对应的说话人脸视频。针对现有方法中生成人脸身份信息较差、脸部细节较差的问题,提出了一种基于关键点的语音驱动说话人脸视频生成模型LTFG-GAN。该模型首先将基于在语音识别领域微调的无监督预训练模型作为语音编码器,通过融合卷积与注意力机制预测人脸关键点;其次在人脸生成过程中加入交叉注意力机制获取原始参考人脸信息,通过条件卷积与空间自适应归一化将扭曲得到高维形变人脸信息与原始人脸信息融合;最终得到与语音同步的说话人脸视频。实验结果表明,上述方法对于人脸的生成有明显地提升。
Abstract: Voice driven face generation aims to generate speech facial videos that have the same identity information as the reference face and correspond to the speech content. A speech driven facial video generation model based on landmarks, LTFG-GAN, is proposed to address the issues of poor facial identity information and facial details in existing methods. The model first uses an unsupervised pre trained model fine-tuned in the field of speech recognition as a speech encoder, and predicts facial landmarks by integrating convolution and attention mechanisms; Secondly, a cross-attention mechanism is added to the face generation process to obtain the original reference face information. The distorted high-dimensional deformed face information is fused with the original face information through conditional convolution and spatial adaptive normalization; The final result is a speech synchronized facial video. The experimental results show that the above method has a significant improvement in face generation.
文章引用:李昊渊. 基于语音驱动的人脸生成[J]. 计算机科学与应用, 2025, 15(1): 199-208. https://doi.org/10.12677/csa.2025.151020

参考文献

[1] 年福东, 王文涛, 王妍, 等. 基于关键点表示的语音驱动说话人脸视频生成[J]. 模式识别与人工智能, 2021, 34(6): 572-580.
[2] Chung, J.S., Jamaludin, A. and Zisserman, A. (2017) You Said That? arXiv: 1705.02966.
[3] Mukhopadhyay, R., Philip, J., et al. (2019) Towards Automatic Face-to-Face Translation. Proceedings of the 27th ACM International Conference on Multimedia, Nice, 21-25 October 2019, 1428-1436.
[4] Prajwal, K.R., Mukhopadhyay, R., Namboodiri, V.P. and Jawahar, C.V. (2020) A Lip Sync Expert Is All You Need for Speech to Lip Generation in the Wild. Proceedings of the 28th ACM International Conference on Multimedia, Seattle, 12-16 October 2020, 484-492. [Google Scholar] [CrossRef
[5] Chung, J.S. and Zisserman, A. (2017) Out of Time: Automated Lip Sync in the Wild. In: Chen, C.S., Lu, J. and Ma, K.K., Eds., Computer VisionACCV 2016 Workshops, Springer, 251-263. [Google Scholar] [CrossRef
[6] Cheng, K., Cun, X., Zhang, Y., Xia, M., Yin, F., Zhu, M., et al. (2022) Videoretalking: Audio-Based Lip Synchronization for Talking Head Video Editing in the Wild. SIGGRAPH Asia 2022 Conference Papers, Daegu, 6-9 December 2022, 1-9. [Google Scholar] [CrossRef
[7] Suwajanakorn, S., Seitz, S.M. and Kemelmacher-Shlizerman, I. (2017) Synthesizing Obama: Learning Lip Sync from Audio. ACM Transactions on Graphics, 36, 1-13. [Google Scholar] [CrossRef
[8] Zhang, X. and Weng, L. (2020) Realistic Speech-Driven Talking Video Generation with Personalized Pose. Complexity, 2020, Article ID: 6629634. [Google Scholar] [CrossRef
[9] Guo, Y., Chen, K., Liang, S., Liu, Y., Bao, H. and Zhang, J. (2021) Ad-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 5764-5774. [Google Scholar] [CrossRef
[10] Zhang, Z., Hu, Z., Deng, W., Fan, C., Lv, T. and Ding, Y. (2023) Dinet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 3543-3551. [Google Scholar] [CrossRef
[11] Baevski, A., Zhou, Y., Mohamed, A., et al. (2020) Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Proceedings of the 34th International Conference on Neural Information Processing System, Vancouver, 6-12 December 2020, 12449-12460.
[12] Peng, Z., Huang, W., Gu, S., Xie, L., Wang, Y., Jiao, J., et al. (2021) Conformer: Local Features Coupling Global Representations for Visual Recognition. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 357-366. [Google Scholar] [CrossRef
[13] Zhong, W., Fang, C., Cai, Y., Wei, P., Zhao, G., Lin, L., et al. (2023) Identity-Preserving Talking Face Generation with Landmark and Appearance Priors. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 9729-9738. [Google Scholar] [CrossRef
[14] Wang, T., Liu, M., Zhu, J., Tao, A., Kautz, J. and Catanzaro, B. (2018) High-Resolution Image Synthesis and Semantic Manipulation with Conditional Gans. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 8798-8807. [Google Scholar] [CrossRef
[15] Li, J., Tu, W. and Xiao, L. (2023) Freevc: Towards High-Quality Text-Free One-Shot Voice Conversion. ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, 4-10 June 2023, 1-5. [Google Scholar] [CrossRef
[16] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017) Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 6000-6010.
[17] Liu, X., Yin, G., Shao, J., et al. (2019) Learning to Predict Layout-to-Image Conditional Convolutions for Semantic Image Synthesis. Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, 8-14 December 2019, 570-580.
[18] Johnson, J., Alahi, A. and Fei-Fei, L. (2016) Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In: Leibe, B., Matas, J., Sebe, N. and Welling, M., Eds., Computer VisionECCV 2016, Springer, 694-711. [Google Scholar] [CrossRef
[19] Afouras, T., Chung, J.S., Senior, A., Vinyals, O. and Zisserman, A. (208) Deep Audio-Visual Speech Recognition. arXiv: 1809.02108.
[20] Wang, J., Qian, X., Zhang, M., Tan, R.T. and Li, H. (2023) Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 14653-14662. [Google Scholar] [CrossRef