基于Transformer的增强局部特征的细粒度图像分类模型
Fine-Grained Image Classification Model Based on Transformer and Enhanced Local Features
DOI: 10.12677/mos.2024.134426, PDF,    国家自然科学基金支持
作者: 李 烨, 蔡家麒:上海理工大学,光电信息与计算机工程学院,上海
关键词: 细粒度图像分类Vision Transformer局部特征可变形卷积自注意力模块Fine-Grained Image Classification Vision Transformer Local Feature Deformable Convolution Self-Attention Modules
摘要: ViT (Vision Transformer)已经被广泛地运用于精细级别的视觉分类上,针对其对于局部信息捕获能力不足的问题,提出一种新的基于Transformer的增强局部特征的细粒度图像分类模型。首先提出了注意力嵌入模块,借由可变形卷积和注意力模块在输入模型之前将原图转换为更关注重要信息的特征,之后再将这些特征嵌入到模型中去,从而提升输入的有效局部特征。其次,提出增强自注意力模块用于ViT原始模型中,使得全局依赖和局部依赖关系可以同时被处理,通过自注意力机制和卷积操作的结合,可以更好地处理局部特征。最后,采用交叉熵损失和对比损失结合的方式,对子类别之间微小的差异进行了优化,以尽可能降低不同标签分类token的相似度,提高相同标签分类token的相似度。所提的算法在CUB-200-2011、Stanford Dogs和NABirds三个细粒度图像数据集的识别精确度达到了91.8%、90.1%和90.3%,超越了多种业内领先的细粒度图像分类技术。
Abstract: ViT (Vision Transformer) has been widely applied to fine-grained visual classification. To address its deficiency in capturing local information, a new fine-grained image classification model based on Transformer and enhanced local features is proposed. Initially, an attention embedding module is introduced, utilizing deformable convolution and attention modules to transform the original image into features that focus more on important information before being input into the model, thereby enhancing the effective local features of the input. Secondly, an enhanced self-attention module is proposed for use in the original ViT model, allowing for simultaneous processing of global and local dependencies. The combination of self-attention mechanisms and convolution operations facilitates better handling of local features. Lastly, a combined approach of cross-entropy loss and contrastive loss is employed to optimize the subtle differences between sub-categories, aiming to minimize the similarity of classification tokens with different labels and increase the similarity of those with the same labels. The proposed algorithm achieved recognition accuracies of 91.8%, 90.1%, and 90.3% on the CUB-200-2011, Stanford Dogs, and NABirds fine-grained image datasets respectively, surpassing several leading fine-grained image classification technologies in the industry.
文章引用:李烨, 蔡家麒. 基于Transformer的增强局部特征的细粒度图像分类模型[J]. 建模与仿真, 2024, 13(4): 4702-4714. https://doi.org/10.12677/mos.2024.134426

参考文献

[1] Wang, Y. and Wang, Z. (2019) A Survey of Recent Work on Fine-Grained Image Classification Techniques. Journal of Visual Communication and Image Representation, 59, 210-214. [Google Scholar] [CrossRef
[2] Zheng, H., Fu, J., Zha, Z.J., et al. (2019) Learning Deep Bilinear Transformation for Fine-Grained Image Representation. 33rd Annual Confer-ence on Neural Information Processing Systems (NeurIPS 2019), Vancouver, 8-14 December 2019.
[3] Kong, S. and Fowlkes, C. (2017) Low-Rank Bilinear Pooling for fine-Grained Classification. Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 365-374. [Google Scholar] [CrossRef
[4] Chen, S., Wang, Z. and Chen, W. (2020) Driver Drowsiness Estimation Based on Factorized Bilinear Feature Fusion and a Long-Short-Term Recurrent Convolutional Network. Information, 12, Article 3. [Google Scholar] [CrossRef
[5] Ni, Z.L., Bian, G.B., Li, Z., et al. (2022) Space Squeeze Reasoning and low-Rank Bilinear Feature Fusion for Surgical Image Segmentation. IEEE Journal of Biomedical and Health Informatics, 26, 3209-3217. [Google Scholar] [CrossRef
[6] Wei, X.S., Xie, C.W., Wu, J., et al. (2018) Mask-CNN: Localizing Parts and Selecting Descriptors for Fine-Grained Bird Species Categorization. Pattern Recognition, 76, 704-714. [Google Scholar] [CrossRef
[7] Lin, T.Y., RoyChowdhury, A. and Maji, S. (2015) Bilinear CNN Models for Fine-Grained Visual Recognition. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 1449-1457. [Google Scholar] [CrossRef
[8] Zheng, H., Fu, J., Mei, T., et al. (2017) Learning Multi-Attention Convolu-tional Neural Network for Fine-Grained Image Recognition. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 5209-5217. [Google Scholar] [CrossRef
[9] Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, 4-9 December 2017.
[10] Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020) An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929.
[11] Dai, J.F., Qi, H.Z., Xiong, Y.W., et al. (2017) Deformable Convolutional Networks. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 764-773. [Google Scholar] [CrossRef
[12] Hu, J., Shen, L. and Sun, G. (2018) Squeeze-and-Excitation Networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 7132-7141. [Google Scholar] [CrossRef
[13] Zhou, J., Wang, P., Wang, F., et al. (2021) Elsa: Enhanced Local Self-Attention for Vision Transformer. arXiv preprint arXiv:2112.12786.
[14] Ashraf, M., Abid, F., Din, I.U., et al. (2023) A Hybrid CNN and RNN Variant Model for Music Classification. Applied Sciences, 13, Article 1476. [Google Scholar] [CrossRef
[15] Nassif, R., Kar, S. and Vlaski, S. (2024) Learning Dynamics of Low-Precision Clipped SGD with Momentum. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, 14-19 April 2024, 6075-6079. [Google Scholar] [CrossRef
[16] Cazenave, T., Sentuc, J. and Videau, M. (2021) Cosine Annealing, Mixnet and Swish Activation for Computer Go. Springer International Publishing, 53-60. [Google Scholar] [CrossRef
[17] He, K., Zhang, X., Ren, S., et al. (2016) Deep Residual Learn-ing for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 770-778. [Google Scholar] [CrossRef
[18] Dubey, A., Gupta, O., Raskar, R., et al. (2018) Maximum-Entropy Fine Grained Classification. 32ndConference on Neural Information Processing Systems (NeurIPS 2018), Montreal, 2-8 December 2018.
[19] Luo, W., Yang, X.T., Mo, X.J., et al. (2019) Cross-x Learning for Fine-Grained Visual Categorization. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 8242-8251.
[20] Zheng, H., Fu, J., Zha, Z.J., et al. (2019) Learning Deep Bilinear Transformation for Fine-Grained Image Representation. 33rd Annual Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, 8-14 De-cember 2019
[21] 丁文谦, 余鹏飞, 李海燕, 等. 基于 Xception 网络的弱监督细粒度图像分类[J]. 计算机工程与应用, 2022, 58(2): 235-243.
[22] Du, R., Chang, D., Bhunia, A.K., et al. (2020) Fine-Grained Visual Classification via Progressive Multi-Granularity Training of Jigsaw Patches. In: Vedaldi, A., Bischof, H., Brox, T. and Frahm, J.-M., Eds., European Con-ference on Computer Vision-ECCV 2020, Springer International Publishing, 153-168.
[23] Zhuang, P., Wang, Y. and Qiao, Y. (2020) Learning Attentive Pairwise Interaction for Fine-Grained Classification. Proceedings of the AAAI Conference on Artificial Intelligence, New York, 7-12 February 2020, 13130-13137.
[24] Sun, J.M., Shen, Z.H., Wang, Y., et al. (2021) LoFTR: De-tector-Free Local Feature Matching with Transformers. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recogni-tion (CVPR), Nashville, 20-25 June 2021, 8922-8931. [Google Scholar] [CrossRef
[25] Leng, C., Zhang, H., Li, B., et al. (2018) Local Feature Descriptor for Image Matching: A Survey. IEEE Access, 7, 6424-6434. [Google Scholar] [CrossRef
[26] Kong, F., Li, M., Liu, S., et al. (2022) Residual Local Feature Network for Efficient Super-Resolution. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, 19-20 June 2022, 766-776. [Google Scholar] [CrossRef