基于掩码Transformer的图像修复网络
An Image in Painting Model Based on Mask Transformer
摘要: 现有的基于深度学习的图像修复网络通常采用注意力机制以相似匹配的方式将完好区域信息填充到待修复区域来提升待修复区域的纹理细节。然而,现有的注意力机制的度量方式仅考虑特征纹理而缺少对语义的理解以至于会利用到一些语义不相似区域的信息。为了解决这一问题,本文提出一种基于掩码transformer的图像修复网络,该掩码transformer模块相较于基本的transformer层的区别主要包括两部分:1) 通过掩码将特征图分为有效区域和无效区域并提出掩码注意力机制有效的建模待修复区域和完好区域的相似性;2) 提出用查询集和相似度矩阵加权融合的方法为待修复区域精确填充信息。与传统的注意力机制相比,基于transformer的方法能够较为有效的提升修复的纹理效果。
Abstract: Existing deep learning-based image repair networks usually use an attention mechanism to fill intact area information into the area to be repaired in a similar matching manner to improve the texture details of the area to be repaired. However, the existing measurement method of attention mechanism only considers the feature texture and lacks the understanding of semantics, so that it will use the information of some semantically dissimilar regions. In order to solve this problem, this paper proposes an image restoration network based on mask transformer. The difference between the masked transformer module and the basic transformer layer mainly includes two parts: 1) The feature map is divided into valid regions and invalid regions by mask, and the mask attention mechanism is proposed to effectively model the similarity between the regions to be repaired and the intact regions; 2) A method of weighted fusion of query set and similarity matrix is proposed to accurately fill in information for the region to be repaired. Compared with the traditional attention mechanism, the transformer-based method can effectively improve the texture effect of repair.
文章引用:康延亭, 王直杰. 基于掩码Transformer的图像修复网络[J]. 计算机科学与应用, 2022, 12(1): 83-94. https://doi.org/10.12677/CSA.2022.121010

参考文献

[1] Ballester, C., Bertalmio, M., Caselles, V., et al. (2001) Filling-In by Joint Interpolation of Vector Fields and Gray Levels. IEEE Transactions on Image Processing, 10, 1200-1211. [Google Scholar] [CrossRef] [PubMed]
[2] Bertalmio, M., Sapiro, G., Caselles, V., et al. (2000) Image Inpainting. SIGGRAPH Conference, New Orleans, 23-28 July 2000, 417-424. [Google Scholar] [CrossRef
[3] Bertalmio, M., Vese, L., Sapiro, G., et al. (2003) Simulta-neous Structure and Texture Image Inpainting. IEEE Transactions on Image Processing, 12, 882-889. [Google Scholar] [CrossRef
[4] Shen, J. and Chen, T. (2003) Euler’s Elastica and Curvature-Based Inpainting. SIAM Journal on Applied Mathematics, 63, 564-592. [Google Scholar] [CrossRef
[5] Barnes, C., Shechtman, E., Finkelstein, A., et al. (2009) Patchmatch: A Randomized Correspondence Algorithm for Structural Image Editing. Proceedings of ACM SIGGRAPH, Vol. 28, 1-11. [Google Scholar] [CrossRef
[6] Drori, I., Cohen-Or, D. and Yeshurun, H. (2003) Fragment-Based Image Completion. ACM Transactions on Graphics, 22, 303-312. [Google Scholar] [CrossRef
[7] Esedoglu, S. and Shen, J. (2003) Digital Inpainting Based on the Mumford-Shah-Euler Image Model. European Journal of Applied Mathematics, 13, 353-370. [Google Scholar] [CrossRef
[8] Xu, Z. and Sun, J. (2010) Image Inpainting by Patch Propaga-tion Using Patch Sparsity. IEEE Transactions on Image Processing: A Publication of the IEEE Signal Processing Society, 19, 1153-1165. [Google Scholar] [CrossRef
[9] Wang, Z., Bovik, A., Sheikh, H.R., et al. (2004) Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, 13, 600-612. [Google Scholar] [CrossRef
[10] Lindeberg, T. (2012) Scale Invariant Feature Transform. Scholar-pedia, 7, 10491. [Google Scholar] [CrossRef
[11] Efros, A.A. and Leung, T.K. (1999) Texture Synthesis by Non-Parametric Sampling. Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, 20-27 September 1999, 1033-1038. [Google Scholar] [CrossRef
[12] Criminisi, A., Perez, P. and Toyama, K. (2004) Region Filling and Object Removal by Exemplar-Based Image Inpainting. IEEE Transactions on Image Processing, 13, 1200-1212. [Google Scholar] [CrossRef
[13] Levin, A., Zomet, A., Peleg, S., et al. (2004) Seamless Image Stitching in the Gradient Domain. 8th European Conference on Computer Vision, Prague, 11-14 May 2004, 377-389. [Google Scholar] [CrossRef
[14] Pathak, D., Krahenbuhl, P., Donahue, J., et al. (2016) Context Encoders: Feature Learning by Inpainting. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 2536-2544. [Google Scholar] [CrossRef
[15] Lowe, D.G. (1999) Object Recognition from Local Scale-Invariant Features. Proceedings of IEEE International Conference on Computer Vision, Corfu, 20-25 September 1999, 1150-1157. [Google Scholar] [CrossRef
[16] Simakov, D., Caspi, Y., Shechtman, E., et al. (2008) Summarizing Visual Data Using Bidirectional Similarity. IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, 24-26 June 2008, 1-8. [Google Scholar] [CrossRef
[17] Satoshi, L., Edgar, S.-S. and Hiroshi, I. (2017) Globally and Locally Consistent Image Completion. ACM Transactions on Graphics, 36, 107:1-107:14. [Google Scholar] [CrossRef
[18] Liu, G., Reda, F.A., Shih, K.J., et al. (2018) Image Inpainting for Irregular Holes Using Partial Convolutions. In: European Conference on Computer Vision, Springer, Cham, 85-100. [Google Scholar] [CrossRef
[19] Yu, J., Lin, Z., Yang, J., et al. (2018) Generative Image Inpainting with Contextual Attention. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-22 June 2018, 5505-5514. [Google Scholar] [CrossRef
[20] Zheng, C., Cham, T.J. and Cai, J. (2019) Pluralistic Image Com-pletion. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 1438-1447. [Google Scholar] [CrossRef
[21] Yu, J., Lin, Z., Yang, J., et al. (2018) Free-Form Image Inpainting with Gated Convolution. 2019 IEEE/CVF International Conference on Computer Vision Workshops, Seoul, 27-28 October 2019, 4471-4480. [Google Scholar] [CrossRef
[22] Nazeri, K., Ng, E., Joseph, T., et al. (2019) EdgeConnect: Structure Guided Image Inpainting Using Edge Prediction. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, 27-28 October 2019, 1-8. [Google Scholar] [CrossRef
[23] Li, J., Wang, N., Zhang, L., et al. (2020) Recurrent Feature Reasoning for Image Inpainting. IEEE Conference on Computer Vision and Pattern Recognition, Seattle, 14-19 June 2020, 7760-7768. [Google Scholar] [CrossRef
[24] Xie, C., Liu, S., Li, C., et al. (2019) Image Inpainting with Learnable Bidirectional Attention Maps. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27-28 October 2019, 8858-8867. [Google Scholar] [CrossRef
[25] Miyato, T., Kataoka, T., Koyama, M., et al. (2018) Spectral Nor-malization for Generative Adversarial Networks. 6th International Conference on Learning Representations, Vancouver, 30 April-3 May, 2018.
[26] Zeng, Y., Fu, J., Chao, H., et al. (2019) Learning Pyramid-Context Encoder Network for High-Quality Image Inpainting. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 1486-1494. [Google Scholar] [CrossRef
[27] Xiao, Q., Li, G. and Chen, Q. (2018) Deep Inception Generative Network for Cognitive Image Inpainting.
[28] Yang, C., Lu, X., Lin, Z., et al. (2017) High-Resolution Image Inpainting Using Multi-Scale Neural Patch Synthesis. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 6721-6729. [Google Scholar] [CrossRef
[29] Song, Y., Yang, C., Lin, Z., et al. (2017) Contextual-Based Image Inpainting: Infer, Match, and Translate. 15th European Conference on Computer Vision, Munich, 8-14 September 2018, 3-19. [Google Scholar] [CrossRef
[30] Sagong, M.C., Shin, Y.G., Kim, S.W., et al. (2020) PEPSI: Fast Image Inpainting with Parallel Decoding Network. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 11360-11368. [Google Scholar] [CrossRef
[31] Vaswani, A., Shazeer, N., Niki, P., et al. (2017) Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, 4-9 December 2017, 5998-6008.
[32] Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020) An Image Is Worth 16x16 Words: Trans-formers for Image Recognition at Scale.
[33] Chen, M., Radford, A., Child, R., et al. (2020) Generative Pretraining from Pixels. International Conference on Machine Learning, Vienna, 13-18 July 2020, 1691-1703.
[34] Ba, J.L., Kiros, J.R. and Hinton, G.E. (2016) Layer Normalization.
[35] Simonyan, K. and Zisserman, A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. Computer Science.
[36] Zhou, B., Lapedriza, A., Khosla, A., et al. (2018) Places: A 10 Million Image Database for Scene Recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence, 40, 1452-1464. [Google Scholar] [CrossRef
[37] Karras, T., Aila, T., Laine, S., et al. (2017) Progressive Growing of GANs for Improved Quality, Stability, and Variation.
[38] Efros, A.A., et al. (2015) What Makes Paris Look Like Paris? ACM Transactions on Graphics, 31, 1-9. [Google Scholar] [CrossRef