基于自相似结构特征和显著特征深度正交融合的图像检索
Image Retrieval with Deep Orthogonal Fusion of Self-Similarity Descriptor and Salient Features
DOI: 10.12677/mos.2025.142140, PDF,    科研立项经费支持
作者: 陈 浩, 魏 赟:上海理工大学光电信息与计算机工程学院,上海
关键词: 图像检索自相似结构注意力机制正交融合Image Retrieval Self-Similarity Structure Attention Mechanism Orthogonal Fusion
摘要: 复杂场景下,由于图像内容复杂,细节信息丰富,以致深度学习网络提取的特征难以有效表达图像的重点信息。本文提出了融合正交显著特征和自相似描述符的图像检索模型。设计了自相似结构分支,获得图像局部自相似结构特征,将其编码为紧凑的自相似描述符,以有效描述图像内的结构信息;引入了注意力分支,将特征图中各通道相同位置的像素点作为一个向量,通过范数注意力生成包含显著特征的向量,通过自注意力和交叉注意力得到增强的显著特征。最后,引入了一个正交融合模块,融合结构特征和显著特征,从而得到复杂场景下图像的有效特征。实验证明,通过融合显著特征和结构特征,我们可以很好地提升基于全局表示的图像检索性能。
Abstract: In complex scenes, due to the intricate content and rich details of images, the features extracted by deep learning networks often fail to effectively represent the key information of the image. In this paper, we propose an image retrieval model that integrates orthogonal salient features and self-similarity descriptors. We design a self-similarity structural branch to obtain local self-similarity structural features of the image, which are encoded into compact self-similarity descriptors to effectively describe the structural information within the image. Additionally, an attention branch is introduced, where the pixels at the same position across all channels of the feature map are treated as a vector. Norm-based attention is used to generate a vector containing salient features, and enhanced salient features are obtained through both self-attention and cross-attention mechanisms. Finally, an orthogonal fusion module is introduced to combine the structural features and salient features, resulting in effective features for image retrieval in complex scenes. Experimental results demonstrate that by integrating salient features and structural features, we can significantly improve the performance of image retrieval based on global representations.
文章引用:陈浩, 魏赟. 基于自相似结构特征和显著特征深度正交融合的图像检索[J]. 建模与仿真, 2025, 14(2): 157-170. https://doi.org/10.12677/mos.2025.142140

参考文献

[1] Noh, H., Araujo, A., Sim, J., Weyand, T. and Han, B. (2017) Large-Scale Image Retrieval with Attentive Deep Local Features. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 3476-3485. [Google Scholar] [CrossRef
[2] Lee, S., Lee, S., Seong, H. and Kim, E. (2023) Revisiting Self-Similarity: Structural Embedding for Image Retrieval. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 23412-23421. [Google Scholar] [CrossRef
[3] Ng, T., et al. (2020) SOLAR: Second-Order Loss and Attention for Image Retrieval. Computer Vision-ECCV 2020: 16th European Conference, Glasgow, 23-28 August 2020, 253-270. [Google Scholar] [CrossRef
[4] Cao, B.Y., Araujo, A. and Sim, J. (2020) Unifying Deep Local and Global Features for Image Search. Computer Vision-ECCV 2020: 16th European Conference, Glasgow, 23-28 August 2020, 726-743. [Google Scholar] [CrossRef
[5] Wu, H., Wang, M., Zhou, W., Hu, Y. and Li, H. (2022) Learning Token-Based Representation for Image Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 36, 2703-2711. [Google Scholar] [CrossRef
[6] Shao, S., Chen, K., Karpur, A., Cui, Q., Araujo, A. and Cao, B. (2023) Global Features Are All You Need for Image Retrieval and Reranking. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 11002-11012. [Google Scholar] [CrossRef
[7] Yang, M., He, D., Fan, M., Shi, B., Xue, X., Li, F., et al. (2021) DOLG: Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local and Global Features. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 11752-11761. [Google Scholar] [CrossRef
[8] Zhang, Z., Wang, L., Zhou, L. and Koniusz, P. (2023) Learning Spatial-Context-Aware Global Visual Feature Representation for Instance Image Retrieval. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 11216-11225. [Google Scholar] [CrossRef
[9] Kwon, H., Kim, M., Kwak, S. and Cho, M. (2021) Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 13045-13055. [Google Scholar] [CrossRef
[10] Shechtman, E. and Irani, M. (2007) Matching Local Self-Similarities across Images and Videos. 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, 17-22 June 2007, 1-8. [Google Scholar] [CrossRef
[11] Deselaers, T. and Ferrari, V. (2010) Global and Efficient Self-Similarity for Object Classification and Detection. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, 13-18 June 2010, 1633-1640. [Google Scholar] [CrossRef
[12] Fan, J., Xiong, Q., Ye, Y. and Li, J. (2023) Combining Phase Congruency and Self-Similarity Features for Multimodal Remote Sensing Image Matching. IEEE Geoscience and Remote Sensing Letters, 20, 1-5. [Google Scholar] [CrossRef
[13] Ma, J., Jiang, X., Fan, A., Jiang, J. and Yan, J. (2020) Image Matching from Handcrafted to Deep Features: A Survey. International Journal of Computer Vision, 129, 23-79. [Google Scholar] [CrossRef
[14] Song, T., Kim, S. and Sohn, K. (2023) Unsupervised Deep Asymmetric Stereo Matching with Spatially-Adaptive Self-similarity. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 13672-13680. [Google Scholar] [CrossRef
[15] Wang, H., Zhang, R., Feng, M., Liu, Y. and Yang, G. (2023) Global Context-Based Self-Similarity Feature Augmentation and Bidirectional Feature Fusion for Surface Defect Detection. IEEE Transactions on Instrumentation and Measurement, 72, 1-12. [Google Scholar] [CrossRef
[16] Hu, Z. and Bors, A.G. (2023) Co-attention Enabled Content-Based Image Retrieval. Neural Networks, 164, 245-263. [Google Scholar] [CrossRef] [PubMed]
[17] Zhang, J., Xia, K., Huang, Z., Wang, S. and Akindele, R.G. (2023) ETAM: Ensemble Transformer with Attention Modules for Detection of Small Objects. Expert Systems with Applications, 224, Article ID: 119997. [Google Scholar] [CrossRef
[18] Zhou, Q., Shi, H., Xiang, W., Kang, B. and Latecki, L.J. (2024) DPNet: Dual-Path Network for Real-Time Object Detection with Lightweight Attention. IEEE Transactions on Neural Networks and Learning Systems, 1-15. [Google Scholar] [CrossRef] [PubMed]
[19] Woo, S., Park, J., Lee, J. and Kweon, I.S. (2018) CBAM: Convolutional Block Attention Module. Computer VisionECCV, Munich, 8-14 September 2018, 3-19. [Google Scholar] [CrossRef
[20] Dosovitskiy, A., et al. (2020) An Image Is Worth 16 x 16 Words: Transformers for Image Recognition at Scale.
[21] Song, C.H., Yoon, J., Choi, S. and Avrithis, Y. (2023) Boosting Vision Transformers for Image Retrieval. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 2-7 January 2023, 107-117. [Google Scholar] [CrossRef
[22] Zhou, Z., Li, G. and Wang, G. (2023) A Hybrid of Transformer and CNN for Efficient Single Image Super-Resolution via Multi-Level Distillation. Displays, 76, Article ID: 102352. [Google Scholar] [CrossRef
[23] Yuan, F., Zhang, Z. and Fang, Z. (2023) An Effective CNN and Transformer Complementary Network for Medical Image Segmentation. Pattern Recognition, 136, Article ID: 109228. [Google Scholar] [CrossRef
[24] Kang, D., Kwon, H., Min, J. and Cho, M. (2021) Relational Embedding for Few-Shot Classification. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 8802-8813. [Google Scholar] [CrossRef
[25] Ye, Y., Yu, C., Chang, Y., Zhu, L., Zhao, X., Yan, L., et al. (2022) Unsupervised Deraining: Where Contrastive Learning Meets Self-Similarity. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 5811-5820. [Google Scholar] [CrossRef
[26] Wu, L., Liu, D., Zhang, W., Chen, D., Ge, Z., Boussaid, F., et al. (2022) Pseudo-pair Based Self-Similarity Learning for Unsupervised Person Re-identification. IEEE Transactions on Image Processing, 31, 4803-4816. [Google Scholar] [CrossRef] [PubMed]
[27] Pang, Y., Zhang, H., Zhu, L., Liu, D. and Liu, L. (2024) Self-Similarity Guided Probabilistic Embedding Matching Based on Transformer for Occluded Person Re-identification. Expert Systems with Applications, 237, Article ID: 121504. [Google Scholar] [CrossRef
[28] Chen, Y., Zhang, Z., Wang, Y., Zhang, Y., Feng, R., Zhang, T., et al. (2022) Ae-net: Fine-Grained Sketch-Based Image Retrieval via Attention-Enhanced Network. Pattern Recognition, 122, Article ID: 108291. [Google Scholar] [CrossRef
[29] Zhu, M., et al. (2023) Domain-Aware Double Attention Network for Zero-Shot Sketch-Based Image Retrieval with Similarity Loss. The Visual Computer, 40, 3091-3101.
[30] Hou, D., Wang, S., Tian, X. and Xing, H. (2022) An Attention-Enhanced End-to-End Discriminative Network with Multiscale Feature Learning for Remote Sensing Image Retrieval. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15, 8245-8255. [Google Scholar] [CrossRef
[31] Song, C.H., Han, H.J. and Avrithis, Y. (2022) All the Attention You Need: Global-Local, Spatial-Channel Attention for Image Retrieval. 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 3-8 January 2022, 439-448. [Google Scholar] [CrossRef
[32] Jegou, H., Perronnin, F., Douze, M., Sanchez, J., Perez, P. and Schmid, C. (2012) Aggregating Local Image Descriptors into Compact Codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34, 1704-1716. [Google Scholar] [CrossRef] [PubMed]
[33] Tolias, G., Avrithis, Y. and Jegou, H. (2013) To Aggregate or Not to Aggregate: Selective Match Kernels for Image Search. 2013 IEEE International Conference on Computer Vision, Sydney, 1-8 December 2013, 1401-1408. [Google Scholar] [CrossRef
[34] Weyand, T., Araujo, A., Cao, B. and Sim, J. (2020) Google Landmarks Dataset V2—A Large-Scale Benchmark for Instance-Level Recognition and Retrieval. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 2572-2581. [Google Scholar] [CrossRef