基于多模态特征融合的图像编辑模型
Image Editing Model Based on Multi-Model Feature Fusion
摘要: 为了实现更可控的编辑效果,目前基于条件导向的图像编辑工作大部分都是设置了一定的文本条件引导。然而文本和图像是两种不同的模态,因此如何从不同模态的数据中有效地检索和融合不同模态的信息是图像编辑中一个很大的挑战。针对图像和文本之间的信息无法很好的深度交互的问题,提出了一种基于多模态特征融合的MFF图像编辑模型。首先利用多模态模型BLIP对编辑文本进行适应性优化,以便指导生成模型偏向生成更符合语义的图像,然后对源文本嵌入和Unet网络参数进行联合训练,再采用交叉注意力机制在U-Net网络中增强视觉感知能力来融合图文特征,最后将融合的特征经过预先训练的图文生成扩散模型得到与文本描述相关的编辑图像。在COCO数据集上的定量实验结果表明,相比于其他基线模型最好的实验结果,MFF模型在MS-SSIM度量指标提高了12.5%、在LPIPS指标上降低了26.2%,表明模型在实现图像和文本之间的特征融合方面更加有效。
Abstract: To achieve more controllable editing effects, most current condition-guided image editing works rely on text-based conditions. However, text and images represent two different modalities, making it a significant challenge to effectively retrieve and integrate information from different modalities in image editing. To address the issue of limited deep interaction between image and text information, a Multi-Modal Feature Fusion (MFF) image editing model is proposed. First, an adaptive optimization of the editing text is performed using the multi-modal model BLIP to guide the generation model towards producing images that better align with the semantics. Subsequently, joint training of the source text embeddings and U-Net network parameters is conducted, and a cross-modal attention mechanism is employed within the U-Net network to enhance visual perception capabilities for fusing visual and textual features. Finally, the fused features are passed through a pre-trained image-text generation diffusion model to obtain edited images related to the text description. Quantitative experimental results on the COCO dataset show that, compared to other baseline models, the MFF model achieved a 12.5% improvement in the MS-SSIM metric and a 26.2% reduction in the LPIPS metric, indicating its greater effectiveness in feature fusion between images and text.
文章引用:杜佳俊, 兰红. 基于多模态特征融合的图像编辑模型[J]. 计算机科学与应用, 2024, 14(6): 164-176. https://doi.org/10.12677/csa.2024.146153

参考文献

[1] Ling H, Kreis K, Li D, et al. (2021) Editgan: High-Precision Semantic Image Editing. Advances in Neural Information Processing Systems, 34, 16331-16345.
[2] Shi, Y., Yang, X., Wan, Y. and Shen, X. (2022). Semanticstylegan: Learning Compositional Generative Priors for Controllable Image Synthesis and Editing. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 11244-11254.[CrossRef
[3] Alaluf, Y., Patashnik, O., Wu, Z., et al. (2023) Third Time’s the Charm? Image and Video Editing with StyleGAN3. In: Karlinsky, L., Michaeli, T. and Nishino, K., Eds., Computer VisionECCV 2022 Workshops, Springer, 204-220.
[4] Yang, B., Gu, S., Zhang, B., Zhang, T., Chen, X., Sun, X., et al. (2023). Paint by Example: Exemplar-Based Image Editing with Diffusion Models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 18381-18391.[CrossRef
[5] Nichol, A., Dhariwal, P., Ramesh, A., et al. (2021) Glide: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv: 2112.10741.
[6] Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., et al. (2023). Imagic: Text-Based Real Image Editing with Diffusion Models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 6007-6017.[CrossRef
[7] Radford, A., Kim, J.W., Hallacy, C., et al. (2021) Learning Transferable Visual Models from Natural Language Supervision. arXiv: 2103.00020.
[8] Schuhmann, C., Beaumont, R., Vencu, R., et al. (2022) Laion-5b: An Open Large-Scale Dataset for Training Next generation Image-Text Models. Advances in Neural Information Processing Systems, 35, 25278-25294.
[9] Lin, T.Y., Maire, M., Belongie, S., et al. (2014) Microsoft Coco: Common Objects in Context. In: Fleet, D., Pajdla, T., Schiele, B. and Tuytelaars, T., Eds., Computer VisionECCV 2014, Springer, 740-755. [Google Scholar] [CrossRef
[10] Nam, S., Kim, Y. and Kim, S.J. (2018) Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language. arXiv: 1810.11919.
[11] Tao, M., Tang, H., Wu, F., Jing, X., Bao, B. and Xu, C. (2022). DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 16494-16504.[CrossRef
[12] Karras, T., Laine, S. and Aila, T. (2019). A Style-Based Generator Architecture for Generative Adversarial Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 4396-4405.[CrossRef
[13] Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J. and Aila, T. (2020). Analyzing and Improving the Image Quality of StyleGAN. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 8107-8116.[CrossRef
[14] Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D. and Lischinski, D. (2021). StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 2065-2074.[CrossRef
[15] Ho, J., Jain, A. and Abbeel, P. (2020) Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems, 33, 6840-6851.
[16] Nichol, A.Q. and Dhariwal, P. (2021) Improved Denoising Diffusion Probabilistic Models. arXiv: 2102.09672.
[17] Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R. and Van Gool, L. (2022). Repaint: Inpainting Using Denoising Diffusion Probabilistic Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 11451-11461.[CrossRef
[18] Dhariwal, P. and Nichol, A. (2021) Diffusion Models Beat Gans on Image Synthesis. Advances in Neural Information Processing Systems, 34, 8780-8794.
[19] Couairon, G., Verbeek, J., Schwenk, H., et al. (2022) DiffEdit: Diffusion-Based Semantic Image Editing with Mask Guidance. arXiv: 2210.1142.
[20] Mao, W., Han, B. and Wang, Z. (2023). Sketchffusion: Sketch-Guided Image Editing with Diffusion Model. 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, 8-11 October 2023, 790-794.[CrossRef
[21] Meng, C., Song,Y., Song, J., et al. (2021) SDEdit: Image Synthesis and Editing with Stochastic Differential Equations. arXiv: 2108.01073.
[22] Gal, R., Alaluf, Y., Atzmon, Y., et al. (2022) An Image Is Worth One Word: Personalizing Text-to-Image Generation Using Textual Inversion. arXiv: 2208.01618.
[23] Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M. and Aberman, K. (2023). Dreambooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 22500-22510.[CrossRef
[24] Liu, X., Park, D.H., Azadi, S., et al. (2021) More Control for Free! Image Synthesis with Semantic Diffusion Guidance. arXiv: 2112.05744.
[25] Kim, G., Kwon, T. and Ye, J.C. (2022). Diffusionclip: Text-Guided Diffusion Models for Robust Image Manipulation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 2416-2425.[CrossRef
[26] Hertz, A., Mokady, R., Tenenbaum, J., et al. (2022) Prompt-to-Prompt Image Editing with Cross Attention Control. arXiv: 2208.01626.
[27] Zhang, Z., Han, L., Ghosh, A., Metaxas, D. and Ren, J. (2023). SINE: Single Image Editing with Text-To-Image Diffusion Models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 6027-6037.[CrossRef
[28] Li, J., Li, D., Xiong, C., et al. (2022) Blip: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation. International Conference on Machine Learning. PMLR, 12888-12900.
[29] Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Ommer, B. (2022). High-resolution Image Synthesis with Latent Diffusion Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 10674-10685.[CrossRef
[30] Wang, Z., Simoncelli, E.P. and Bovik, A.C. (2003) Multiscale Structural Similarity for Image Quality Assessment. The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, 9-12 November 2003, 1398-1402.
[31] Zhang, R., Isola, P., Efros, A.A., Shechtman, E. and Wang, O. (2018). The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 586-595.[CrossRef