基于掩码图像建模的遮挡图像分类
Occluded Image Classification Based on Masked Image Modeling
DOI: 10.12677/csa.2025.1510265, PDF,    科研立项经费支持
作者: 樊晓曼, 赵其鲁*, 付庆龙:青岛大学计算机科学技术学院,山东 青岛
关键词: 遮挡图像分类掩码图像建模图像分割视觉词典Occluded Image Classification Masked Image Modeling Image Segmentation Visual Vocabulary
摘要: 图像大面积遮挡导致的局部信息缺失与语义混淆是图像分类领域长期存在的挑战。为了提高遮挡图像分类准确率,本文提出一种新颖且鲁棒的基于掩码图像建模的遮挡图像分类框架(SMIM-Net),旨在通过语义感知的掩码建模策略增强模型对遮挡区域的推理能力。该框架引入实例分割模型提取语义边界精确的语义区域作为掩码基本单元,随后通过随机掩码策略构造语义缺失的上下文,并借助通过无监督聚类算法构建的视觉词典提供高层语义监督,迫使模型基于未掩码区域推理被遮挡语义内容。在Pascal与MS-COCO数据集上的实验表明:SMIM-Net在遮挡下的平均分类准确率分别较基线提升15.7%和10.9%;在重度遮挡场景(60%~80%)下,对真实物体片段遮挡(Pascal-o)与真实遮挡(MS-COCO)的分类准确率分别达到90.8%与88.7%,领先最优方法1.5%与3.8%,为遮挡鲁棒分类提供了新范式。
Abstract: Large-area occlusion in images leading to localized information loss and semantic ambiguity has long been a challenge in the field of image classification. To improve the classification accuracy of occluded images, this paper proposes a novel and robust occluded image classification framework based on masked image modeling (SMIM-Net), which aims to enhance the model’s reasoning capability for occluded regions through a semantic-aware masking strategy. The framework introduces an instance segmentation model to extract semantically precise boundaries as basic masking units. Subsequently, a random masking strategy is employed to construct contexts with semantic missing, while a visual dictionary built via an unsupervised clustering algorithm provides high-level semantic supervision, forcing the model to learn to infer occluded semantic content based on unmasked regions. Experiments on the Pascal and MS-COCO datasets demonstrate that SMIM-Net improves the average classification accuracy under occlusion by 15.7% and 10.9%, respectively, compared to the baseline. Under severe occlusion scenarios (60%~80%), the classification accuracy for real object segment occlusion (Pascal-o) and real occlusion (MS-COCO) reaches 90.8% and 88.7%, outperforming the best existing methods by 1.5% and 3.8%, respectively. This work provides a new paradigm for occlusion-robust classification.
文章引用:樊晓曼, 赵其鲁, 付庆龙. 基于掩码图像建模的遮挡图像分类[J]. 计算机科学与应用, 2025, 15(10): 251-265. https://doi.org/10.12677/csa.2025.1510265

参考文献

[1] Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020) An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv: 2010.11929.
[2] Peng, Z., Dong, L., Bao, H., et al. (2022) Beit v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers. arXiv: 2208.06366.
[3] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., et al. (2023) Segment Anything. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 4015-4026. [Google Scholar] [CrossRef
[4] Devlin, J., Chang, M.W., Lee, K., et al. (2019) Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, 4171-4186.
[5] Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., et al. (2022) SimMIM: A Simple Framework for Masked Image Modeling. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 9653-9663. [Google Scholar] [CrossRef
[6] He, K., Chen, X., Xie, S., Li, Y., Dollar, P. and Girshick, R. (2022) Masked Autoencoders Are Scalable Vision Learners. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 16000-16009. [Google Scholar] [CrossRef
[7] Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., et al. (2023) Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 19175-19186. [Google Scholar] [CrossRef
[8] Dong, X., Bao, J., Zhang, T., Chen, D., Zhang, W., Yuan, L., et al. (2023) Peco: Perceptual Codebook for BERT Pre-Training of Vision Transformers. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 552-560. [Google Scholar] [CrossRef
[9] Wei, L., Xie, L., Zhou, W., Li, H. and Tian, Q. (2022) MVP: Multimodality-Guided Visual Pre-Training. In: Avidan, S., et al., Eds., European Conference on Computer Vision, Springer, 337-353. [Google Scholar] [CrossRef
[10] Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., et al. (2014) Generative Adversarial Nets. Communications of the ACM, 63, 139-144.
[11] Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X. and Huang, T.S. (2018) Generative Image Inpainting with Contextual Attention. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 5505-5514. [Google Scholar] [CrossRef
[12] Yang, F., Yang, H., Fu, J., Lu, H. and Guo, B. (2020) Learning Texture Transformer Network for Image Super-Resolution. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 5791-5800. [Google Scholar] [CrossRef
[13] He, S., Luo, H., Wang, P., Wang, F., Li, H. and Jiang, W. (2021) TransReID: Transformer-Based Object Re-Identification. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 15013-15022. [Google Scholar] [CrossRef
[14] Wang, W., Xie, E., Li, X., Fan, D., Song, K., Liang, D., et al. (2021) Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 548-558. [Google Scholar] [CrossRef
[15] Cen, F., Zhao, X., Li, W. and Wang, G. (2021) Deep Feature Augmentation for Occluded Image Classification. Pattern Recognition, 111, Article ID: 107737. [Google Scholar] [CrossRef
[16] Yang, Z., Chen, J., Li, J. and Zheng, X. (2025) Multiscale Occlusion-Robust Scene Classification in Remote Sensing Images via Supervised Contrastive Learning. IEEE Geoscience and Remote Sensing Letters, 22, 1-5. [Google Scholar] [CrossRef
[17] Kotwal, K., Deshmukh, T. and Gopal, P. (2024) Latent Enhancing Autoencoder for Occluded Image Classification. 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, 27-30 October 2024, 894-900. [Google Scholar] [CrossRef
[18] Kortylewski, A., Liu, Q., Wang, H., Zhang, Z. and Yuille, A. (2020) Combining Compositional Models and Deep Networks for Robust Object Classification under Occlusion. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, 1-5 March 2020, 1322-1330. [Google Scholar] [CrossRef
[19] Xiao, M., Kortylewski, A., Wu, R., Qiao, S., Shen, W. and Yuille, A. (2020) TDMPNet: Prototype Network with Recurrent Top-Down Modulation for Robust Object Classification under Partial Occlusion. In: Bartoli, A. and Fusiello, A., Eds., Computer VisionECCV 2020 Workshops, Springer International Publishing, 447-463. [Google Scholar] [CrossRef
[20] Heo, J., Wang, Y. and Park, J. (2022) Occlusion-Aware Spatial Attention Transformer for Occluded Object Recognition. Pattern Recognition Letters, 159, 70-76. [Google Scholar] [CrossRef
[21] Kortylewski, A., He, J., Liu, Q. and Yuille, A.L. (2020) Compositional Convolutional Neural Networks: A Deep Architecture with Innate Robustness to Partial Occlusion. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 8940-8949. [Google Scholar] [CrossRef
[22] Zhao, F., Feng, J., Zhao, J., Yang, W. and Yan, S. (2018) Robust LSTM-Autoencoders for Face De-Occlusion in the Wild. IEEE Transactions on Image Processing, 27, 778-790. [Google Scholar] [CrossRef] [PubMed]
[23] Yun, S., Han, D., Chun, S., Oh, S.J., Yoo, Y. and Choe, J. (2019) Cutmix: Regularization Strategy to Train Strong Classifiers with Localizable Features. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 6023-6032. [Google Scholar] [CrossRef
[24] Wang, J.Y., Zhang, Z.S., Xie, C.H., et al. (2015) Unsupervised Learning of Object Semantic Parts from Internal States of CNNs by Population Encoding. arXiv: 1511.06855.