基于ViT-X在小型数据集上的图像分类
Image Classification Based on ViT-X on Small-Scale Datasets
摘要: 近年来,在图像分类等计算机视觉任务中,Vision Transformer (ViT)展现出了卓越的进展,但ViT网络在建模图像中的局部依赖关系方面常显不足,尤其是在小规模数据集上训练时,可能导致归纳偏置不足的问题。针对该问题,本文提出了一种改进的ViT模型。该模型通过引入功能更强的交叉协方差注意力机制(XCA),增强对多尺度上下文全局依赖关系的建模能力,同时在保持性能优势的情况下减少参数数量。在此基础上,本文还提出一种新颖的模块(Septh-Wise Convolution,简称SWConv),进一步增强局部特征提取能力。实验结果表明,本文提出的ViT-X模型在CIFAR10等经典数据集中取得了优异的性能,该模型识别准确率达到95.6%,较原始ViT模型提升了1.8%,显著提高了模型的识别性能。
Abstract: In recent years, Vision Transformers (ViT) have demonstrated remarkable progress in computer vision tasks such as image classification. However, ViT networks often struggle to model local dependencies within images, especially when trained on small-scale datasets, which can lead to insufficient inductive bias. To address this issue, this paper proposes an improved ViT model. The proposed model introduces a more powerful cross-covariance attention mechanism (XCA) to enhance the modeling of multi-scale contextual global dependencies while reducing the number of parameters without compromising performance. Furthermore, a novel module (Septh-Wise Convolution, SWConv) is proposed to further strengthen local feature extraction capabilities. Experimental results show that the proposed ViT-X model achieves outstanding performance on benchmark datasets such as CIFAR10, reaching an accuracy of 95.6%, which is 1.8% higher than the original ViT model, significantly improving the recognition performance of the model.
文章引用:钟士辉. 基于ViT-X在小型数据集上的图像分类[J]. 计算机科学与应用, 2025, 15(7): 17-26. https://doi.org/10.12677/csa.2025.157176

参考文献

[1] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Un-terthiner, T., Dehghani, M., Minderer, M., Heigold, G. and Gelly, S. (2020) An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv: 2010.11929.
[2] Dong, P., Niu, X., Tian, Z., Li, L., Wang, X., Wei, Z., et al. (2023) Progressive Meta-Pooling Learning for Lightweight Image Classification Model. ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, 4-10 June 2023, 1-5. [Google Scholar] [CrossRef
[3] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A. and J’egou, H. (2021) Training Data-Efficient Image Transformers & Distillation through Attention. arXiv: 2012.12877.
[4] Wei, Z., Pan, H., Li, L.L., Lu, M., Niu, X., Dong, P. and Li, D. (2022) Convformer: Closing the Gap between CNN and Vision Transformers. arXiv: 2209.07738.
[5] Wang, W., Xie, E., Li, X., Fan, D., Song, K., Liang, D., et al. (2021) Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 548-558. [Google Scholar] [CrossRef
[6] Zhu, X., Su, W., Lu, L., Li, B., Wang, X. and Dai, J. (2020) Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv: 2010.04159.
[7] Qin, J., Wu, J., Xiao, X., Li, L. and Wang, X. (2022) Activation Modulation and Recalibration Scheme for Weakly Supervised Semantic Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 36, 2117-2125. [Google Scholar] [CrossRef
[8] Tay, Y., Dehghani, M., Bahri, D. and Metzler, D. (2020) Efficient Transformers: A Survey. arXiv: 2009.06732.
[9] Li, G., Wang, Y., Zhao, Q., Yuan, P. and Chang, B. (2023) PMVT: A Lightweight Vision Transformer for Plant Disease Identification on Mobile Devices. Frontiers in Plant Science, 14, Article 1256773. [Google Scholar] [CrossRef] [PubMed]
[10] He, F., Liu, Y. and Liu, J. (2024) ECA-ViT: Leveraging ECA and Vision Transformer for Crop Leaves Diseases Identification in Cultivation Environments. 2024 4th International Conference on Machine Learning and Intelligent Systems Engineering (MLISE), Zhuhai, 28-30 June 2024, 101-104. [Google Scholar] [CrossRef
[11] Wu, S., Sun, Y. and Huang, H. (2021) Multi-Granularity Feature Extraction Based on Vision Transformer for Tomato Leaf Disease Recognition. 2021 3rd International Academic Exchange Conference on Science and Technology Innovation (IAECST), Guangzhou, 10-12 December 2021, 387-390. [Google Scholar] [CrossRef
[12] Sharma, S.K. and Vishwakarma, D.K. (2024) Classification of Banana Plant Leaves Based on Nutrient Deficiency Using Vision Transformer. 2024 5th International Conference for Emerging Technology (INCET), Belgaum, 24-26 May 2024, 1-6. [Google Scholar] [CrossRef
[13] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I. (2017) Attention Is All You Need. arXiv: 1706.03762.
[14] Sajid, U., Chen, X., Sajid, H., Kim, T. and Wang, G. (2021) Audio-Visual Transformer Based Crowd Counting. 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, 11-17 October 2021, 2249-2259. [Google Scholar] [CrossRef
[15] Ba, J.L., Kiros, J.R. and Hinton, G.E. (2016) Layer Normalization. arXiv: 1607.06450.
[16] Hendrycks, D. and Gimpel, K. (2016) Gaussian Error Linear Units (GELUS). arXiv: 1606.08415.
[17] Ioffe, S. and Szegedy, C. (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the International Conference on Machine Learning (ICML), Lille, 6-11 July 2015, 448-456.
[18] Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J. and Beyer, L. (2021) How to Train Your ViT? Data, Augmentation, and Regularization in Vision Transformers. arXiv: 2106.10270.
[19] Kingma, D.P. and Ba, J. (2014) Adam: A Method for Stochastic Optimization. arXiv: 1412.6980.
[20] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., et al. (2021) Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 9992-10002. [Google Scholar] [CrossRef
[21] Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V. and Le, Q.V. (2019) AutoAugment: Learning Augmentation Strategies from Data. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 113-123. [Google Scholar] [CrossRef
[22] Zhong, Z., Zheng, L., Kang, G., Li, S. and Yang, Y. (2020) Random Erasing Data Augmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 13001-13008. [Google Scholar] [CrossRef
[23] Zhang, H., Cisse, M., Dauphin, Y.N. and Lopez-Paz, D. (2017) MixUp: Beyond Empirical Risk Minimization. arXiv: 1710.09412.
[24] Yun, S., Han, D., Chun, S., Oh, S.J., Yoo, Y. and Choe, J. (2019) CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 6022-6031. [Google Scholar] [CrossRef