基于改进Swin Transformer的深度哈希检索模型
Deep Hash Retrieval Model Based on Improved Swin Transformer
摘要: 随着互联网和多媒体技术的飞速发展,数字图像已经成为现代社会中信息传播和交流的主要载体之一。人们每天都在生成和消费海量的图像数据,从社交媒体的图片分享到专业领域的图像分析,图像信息的规模和复杂性都在不断增长。与此同时,针对这些数据的检索需求也在快速增加,尤其是在需要快速定位和提取特定内容的场景中。然而,现实世界中的图像数据往往呈现出一种长尾分布的特性,即某些类别的数据非常丰富,而另一些类别的数据却极其稀缺。这种不平衡的数据分布为图像检索技术带来了巨大的挑战,尤其是在基于深度哈希技术的检索方法中,如何有效处理长尾分布成为研究的关键问题。针对这个问题,本文从模型层面构建了基于改进Swin Transformer哈希检索模型,以校验本文所设计长尾哈希检索模型在现实场景下的性能表现。详细内容如下:在面对长尾分布图像检索任务中对图像的局部的特征提取能力不足时,提出一种创新的解决方案。该方法核心在于利用双流网络架构将CNN的局部特征与Transformer的全局特征进行融合。同时,基于哈希层的输出数据设计了多目标损失函数。通过以上策略能够实现卷积的局部细节特征与自注意力的全局上下文特征的融合。实验结果表明,本模型能够实现高性能的哈希图像检索且优于当前主流模型,对各类数据集均取得最好或者次好的性能指标。
Abstract: With the rapid development of the Internet and multimedia technology, digital images have become one of the primary carriers for information dissemination and communication in modern society. People generate and consume vast amounts of image data every day—from picture sharing on social media to image analysis in professional fields, the scale and complexity of image information are continuously growing. At the same time, the demand for retrieving this data is also increasing rapidly, especially in scenarios where specific content needs to be quickly located and extracted. However, image data in the real world often exhibits a long-tail distribution characteristic—certain categories of data are highly abundant, while others are extremely scarce. This unbalanced data distribution poses significant challenges to image retrieval technologies, especially for retrieval methods based on deep hashing. Effectively addressing long-tail distribution has become a key research issue. To tackle this problem, this paper constructs a hash retrieval model based on an improved Swin Transformer at the model level, to evaluate the performance of the proposed long-tail hash retrieval model in real-world scenarios. Details are as follows: when the local feature extraction capability of images is insufficient in long-tail image retrieval tasks, an innovative solution is proposed. The core of this method lies in employing a two-stream network architecture that fuses the local features of CNNs with the global features of Transformers. Meanwhile, a multi-objective loss function is designed based on the output of the hash layer. This strategy enables the fusion of convolutional local detail features with the global contextual features from self-attention mechanisms. Experimental results demonstrate that this model can achieve high-performance hash-based image retrieval and outperforms current mainstream models, achieving the best or second-best performance indicators across various datasets.
文章引用:李一昊, 王直杰. 基于改进Swin Transformer的深度哈希检索模型[J]. 计算机科学与应用, 2025, 15(6): 206-219. https://doi.org/10.12677/csa.2025.156171

参考文献

[1] Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. Advances in Neural Information Processing Systems, 30.
[2] Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2012) ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 25.
[3] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., et al. (2021) Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, 10-17 October 2021, 9992-10002. [Google Scholar] [CrossRef
[4] Touvron, H., Cord, M. and Jégou, H. (2022) DeiT III: Revenge of the Vit. In: Lecture Notes in Computer Science, Springer, 516-533. [Google Scholar] [CrossRef
[5] Wu, K., Zhang, J., Peng, H., Liu, M., Xiao, B., Fu, J., et al. (2022) TinyViT: Fast Pretraining Distillation for Small Vision Transformers. In: Lecture Notes in Computer Science, Springer, 68-85. [Google Scholar] [CrossRef
[6] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. and Chen, L. (2018) MobileNetV2: Inverted Residuals and Linear Bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 4510-4520. [Google Scholar] [CrossRef
[7] Chua, T., Tang, J., Hong, R., Li, H., Luo, Z. and Zheng, Y. (2009) NUS-WIDE. Proceedings of the ACM International Conference on Image and Video Retrieval, Santorini, 8-10 July 2009, 1-9. [Google Scholar] [CrossRef
[8] Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014) Microsoft COCO: Common Objects in Context. In: Lecture Notes in Computer Science, Springer, 740-755. [Google Scholar] [CrossRef
[9] Cao, K., Wei, C., Gaidon, A., et al. (2019) Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss. Advances in Neural Information Processing Systems, 32.
[10] Cui, Y., Jia, M., Lin, T., Song, Y. and Belongie, S. (2019) Class-Balanced Loss Based on Effective Number of Samples. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 15-20 June 2019, 9260-9269. [Google Scholar] [CrossRef
[11] Zhou, B., Cui, Q., Wei, X.S., et al. (2020) BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 13-19 June 2020, 9719-9728.
[12] Slaney, M. and Casey, M. (2008) Locality-Sensitive Hashing for Finding Nearest Neighbors [Lecture Notes]. IEEE Signal Processing Magazine, 25, 128-131. [Google Scholar] [CrossRef
[13] Gong, Y., Lazebnik, S., Gordo, A. and Perronnin, F. (2013) Iterative Quantization: A Procrustean Approach to Learning Binary Codes for Large-Scale Image Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 2916-2929. [Google Scholar] [CrossRef] [PubMed]
[14] Kang, W., Li, W. and Zhou, Z. (2016) Column Sampling Based Discrete Supervised Hashing. Proceedings of the AAAI Conference on Artificial Intelligence, 30, 1230-1236. [Google Scholar] [CrossRef
[15] Gui, J., Liu, T., Sun, Z., Tao, D. and Tan, T. (2018) Fast Supervised Discrete Hashing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 490-496. [Google Scholar] [CrossRef] [PubMed]
[16] Cao, Z., Long, M., Wang, J. and Yu, P.S. (2017) HashNet: Deep Learning to Hash by Continuation. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 5609-5618. [Google Scholar] [CrossRef
[17] Cao, Y., Long, M., Liu, B., et al. (2018) Deep Cauchy Hashing for Hamming Space Retrieval. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 1229-1237.
[18] Su, S., Zhang, C., Han, K., et al. (2018) Greedy Hash: Towards Fast Optimization for Accurate Hash Coding in CNN. Advances in Neural Information Processing Systems, 31, 1-10.
[19] Yuan, L., Wang, T., Zhang, X., Tay, F.E., Jie, Z., Liu, W., et al. (2020) Central Similarity Quantization for Efficient Image and Video Retrieval. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle. [Google Scholar] [CrossRef
[20] Fan, L., Ng, K.W., Ju, C., Zhang, T. and Chan, C.S. (2020) Deep Polarized Network for Supervised Learning of Accurate Binary Hashing Codes. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, Seattle, 13-19 June 2020, 3080-3089. [Google Scholar] [CrossRef
[21] Hoe, J.T., Ng, K.W., Zhang, T., et al. (2021) One Loss for All: Deep Hashing with a Single Cosine Similarity Based Learning Objective. Advances in Neural Information Processing Systems, 34, 24286-24298.
[22] Wang, P., Han, K., Wei, X., Zhang, L. and Wang, L. (2021) Contrastive Learning Based Hybrid Networks for Long-Tailed Image Classification. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 943-952. [Google Scholar] [CrossRef
[23] Li, T., Cao, P., Yuan, Y., Fan, L., Yang, Y., Feris, R., et al. (2022) Targeted Supervised Contrastive Learning for Long-Tailed Recognition. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 6908-6918. [Google Scholar] [CrossRef