|
[1]
|
年福东, 王文涛, 王妍, 等. 基于关键点表示的语音驱动说话人脸视频生成[J]. 模式识别与人工智能, 2021, 34(6): 572-580.
|
|
[2]
|
Chung, J.S., Jamaludin, A. and Zisserman, A. (2017) You Said That? arXiv: 1705.02966.
|
|
[3]
|
Mukhopadhyay, R., Philip, J., et al. (2019) Towards Automatic Face-to-Face Translation. Proceedings of the 27th ACM International Conference on Multimedia, Nice, 21-25 October 2019, 1428-1436.
|
|
[4]
|
Prajwal, K.R., Mukhopadhyay, R., Namboodiri, V.P. and Jawahar, C.V. (2020) A Lip Sync Expert Is All You Need for Speech to Lip Generation in the Wild. Proceedings of the 28th ACM International Conference on Multimedia, Seattle, 12-16 October 2020, 484-492. [Google Scholar] [CrossRef]
|
|
[5]
|
Chung, J.S. and Zisserman, A. (2017) Out of Time: Automated Lip Sync in the Wild. In: Chen, C.S., Lu, J. and Ma, K.K., Eds., Computer Vision—ACCV 2016 Workshops, Springer, 251-263. [Google Scholar] [CrossRef]
|
|
[6]
|
Cheng, K., Cun, X., Zhang, Y., Xia, M., Yin, F., Zhu, M., et al. (2022) Videoretalking: Audio-Based Lip Synchronization for Talking Head Video Editing in the Wild. SIGGRAPH Asia 2022 Conference Papers, Daegu, 6-9 December 2022, 1-9. [Google Scholar] [CrossRef]
|
|
[7]
|
Suwajanakorn, S., Seitz, S.M. and Kemelmacher-Shlizerman, I. (2017) Synthesizing Obama: Learning Lip Sync from Audio. ACM Transactions on Graphics, 36, 1-13. [Google Scholar] [CrossRef]
|
|
[8]
|
Zhang, X. and Weng, L. (2020) Realistic Speech-Driven Talking Video Generation with Personalized Pose. Complexity, 2020, Article ID: 6629634. [Google Scholar] [CrossRef]
|
|
[9]
|
Guo, Y., Chen, K., Liang, S., Liu, Y., Bao, H. and Zhang, J. (2021) Ad-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 5764-5774. [Google Scholar] [CrossRef]
|
|
[10]
|
Zhang, Z., Hu, Z., Deng, W., Fan, C., Lv, T. and Ding, Y. (2023) Dinet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 3543-3551. [Google Scholar] [CrossRef]
|
|
[11]
|
Baevski, A., Zhou, Y., Mohamed, A., et al. (2020) Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Proceedings of the 34th International Conference on Neural Information Processing System, Vancouver, 6-12 December 2020, 12449-12460.
|
|
[12]
|
Peng, Z., Huang, W., Gu, S., Xie, L., Wang, Y., Jiao, J., et al. (2021) Conformer: Local Features Coupling Global Representations for Visual Recognition. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 357-366. [Google Scholar] [CrossRef]
|
|
[13]
|
Zhong, W., Fang, C., Cai, Y., Wei, P., Zhao, G., Lin, L., et al. (2023) Identity-Preserving Talking Face Generation with Landmark and Appearance Priors. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 9729-9738. [Google Scholar] [CrossRef]
|
|
[14]
|
Wang, T., Liu, M., Zhu, J., Tao, A., Kautz, J. and Catanzaro, B. (2018) High-Resolution Image Synthesis and Semantic Manipulation with Conditional Gans. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 8798-8807. [Google Scholar] [CrossRef]
|
|
[15]
|
Li, J., Tu, W. and Xiao, L. (2023) Freevc: Towards High-Quality Text-Free One-Shot Voice Conversion. ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, 4-10 June 2023, 1-5. [Google Scholar] [CrossRef]
|
|
[16]
|
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017) Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 6000-6010.
|
|
[17]
|
Liu, X., Yin, G., Shao, J., et al. (2019) Learning to Predict Layout-to-Image Conditional Convolutions for Semantic Image Synthesis. Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, 8-14 December 2019, 570-580.
|
|
[18]
|
Johnson, J., Alahi, A. and Fei-Fei, L. (2016) Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In: Leibe, B., Matas, J., Sebe, N. and Welling, M., Eds., Computer Vision—ECCV 2016, Springer, 694-711. [Google Scholar] [CrossRef]
|
|
[19]
|
Afouras, T., Chung, J.S., Senior, A., Vinyals, O. and Zisserman, A. (208) Deep Audio-Visual Speech Recognition. arXiv: 1809.02108.
|
|
[20]
|
Wang, J., Qian, X., Zhang, M., Tan, R.T. and Li, H. (2023) Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 14653-14662. [Google Scholar] [CrossRef]
|