CNN-Transformer混合模型在计算机视觉领域的研究综述
Review of CNN-Transformer Hybrid Model in Computer Vision
DOI: 10.12677/MOS.2023.124336, PDF,  被引量    国家自然科学基金支持
作者: 戴洋毅, 何 康, 瑚 琦*:上海理工大学光电信息与计算机工程学院,上海;上海理工大学上海市现代光学系统重点实验室,上海;黄 凯:上海理工大学光电信息与计算机工程学院,上海
关键词: 计算机视觉卷积神经网络Transformer混合模型深度学习Computer Vision CNN Transformer Hybrid Model Deep Learning
摘要: 近年来,CNN-Transformer混合模型在计算机视觉领域的研究已经成为热点话题之一。这种模型可以结合卷积神经网络(Convolutional Neural Network, CNN)和Transformer各自的优势,提高模型在多种计算机视觉任务中的性能。首先对CNN与Transformer分别进行简述并分析其优缺点,然后通过介绍与分析近几年国内外表现出色的CNN-Transformer混合模型,对多种常见的混合方式进行分类阐述,这些方法旨在发挥卷积神经网络在局部特征提取方面的优势以及Transformer在全局信息建模方面的优势。最后,对CNN-Transformer混合模型在计算机视觉领域以及其他领域未来所面对的挑战和发展趋势进行展望。
Abstract: In recent years, research on CNN-Transformer hybrid models in computer vision has become one of the hottest topics. This type of model combines the advantages of Convolutional Neural Networks (CNN) and Transformers to improve the performance of various computer vision tasks. First, the pros and cons of CNN and Transformer are briefly introduced and analyzed. Subsequently, various common hybrid methods are elaborated through the introduction and analysis of outstanding CNN transformer hybrid models from national and international research in recent years. These meth-ods aim to leverage the local feature extraction capabilities of Convolutional Neural Networks and the global information modeling capabilities of Transformers. Finally, the paper looks at the chal-lenges and development trends facing CNN-Transformer hybrid models in computer vision and other fields in the future.
文章引用:戴洋毅, 何康, 瑚琦, 黄凯. CNN-Transformer混合模型在计算机视觉领域的研究综述[J]. 建模与仿真, 2023, 12(4): 3657-3672. https://doi.org/10.12677/MOS.2023.124336

参考文献

[1] LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. (1998) Gradient-Based Learning Applied to Document Recognition. Pro-ceedings of the IEEE, 86, 2278-2324. [Google Scholar] [CrossRef
[2] Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2017) Imagenet Classification with Deep Convolutional Neural Networks. Communications of the ACM, 60, 84-90. [Google Scholar] [CrossRef
[3] He, K.M., Zhang, X.Y., Ren, S.Q. and Sun, J. (2016) Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 770-778. [Google Scholar] [CrossRef
[4] Simonyan, K. and Zisserman, A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv: 1409.1556.
[5] Szegedy, C., Ioffe, S., Vanhoucke, V. and Alemi, A. (2017) Incep-tion-v4, Inception-ResNet and the Impact of Residual Connections on Learning. Proceedings of the AAAI Conference on Artifi-cial Intelligence, 31, 4278-4284. [Google Scholar] [CrossRef
[6] Xie, S., Girshick, R., Dollár, P., Tu, Z.W. and He, K.M. (2017) Aggregated Residual Transformations for Deep Neural Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 5987-5995. [Google Scholar] [CrossRef
[7] Huang, G., Liu, Z., Van Der Maaten, L. and Weinberger, K.Q. (2017) Densely Connected Convolutional Networks. Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 2261-2269. [Google Scholar] [CrossRef
[8] Hu, J., Shen, L. and Sun, G. (2018) Squeeze-and-Excitation Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 7132-7141. [Google Scholar] [CrossRef
[9] Tan, M. and Le, Q. (2019) Efficientnet: Rethinking Model Scaling for Convolutional Neural Networks.
http://proceedings.mlr.press/v97/tan19a.html?ref=jina-ai-gmbh.ghost.io
[10] Liu, Z., Mao, H., Wu, C.Y., et al. (2022) A ConvNet for the 2020s. Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 18-24 June 2022, 11966-11976. [Google Scholar] [CrossRef
[11] Iandola, F.N., Han, S., Moskewicz, M.W., et al. (2016) SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and < 0.5 MB Model Size. arXiv: 1602.07360.
[12] Howard, A.G., Zhu, M., Chen, B., et al. (2017) Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv: 1704.04861.
[13] Sandler, M., Howard, A., Zhu, M.L., Zhmoginov, A. and Chen, L.C. (2018) Mobilenetv2: Inverted Residuals and Linear Bottlenecks. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 4510-4520. [Google Scholar] [CrossRef
[14] Howard, A., Sandler, M., Chu, G., et al. (2019) Searching for MobileNetV3. Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, 27 October-2 November 2019, 1314-1324. [Google Scholar] [CrossRef
[15] Han, K., Wang, Y., Tian, Q., et al. (2020) Ghostnet: More Features from Cheap Operations. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 13-19 June 2020, 1577-1586. [Google Scholar] [CrossRef
[16] Tang, Y., Han, K., Guo, J., et al. (2022) GhostNetV2: Enhance Cheap Operation with Long-Range Attention. arXiv: 2211.12905.
[17] Zhang, X.Y., Zhou, X.Y., Lin, M.X. and Sun, J. (2018) ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 6848-6856. [Google Scholar] [CrossRef
[18] Ma, N., Zhang, X., Zheng, H.T. and Sun, J. (2018) ShuffleNet v2: Prac-tical Guidelines for Efficient CNN Architecture Design. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., Computer Vision—ECCV 2018, European Conference on Computer Vision, Springer, Cham, 122-138. [Google Scholar] [CrossRef
[19] Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, 4-9 December 2017.
[20] Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020) An Image Is Worth 16x16 Words: Transformers for Im-age Recognition at Scale. arXiv: 2010.11929.
[21] Liu, Z., Lin, Y., Cao, Y., et al. (2021) Swin Transformer: Hierarchical Vi-sion Transformer Using Shifted Windows. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, 10-17 October 2021, 9992-10002. [Google Scholar] [CrossRef
[22] Liu, Z., Hu, H., Lin, Y., et al. (2022) Swin Transformer v2: Scaling up Capacity and Resolution. Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 18-24 June 2022, 11999-12009. [Google Scholar] [CrossRef
[23] Han, K., Wang, Y., Chen, H., et al. (2022) A Survey on Vision Transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 87-110. [Google Scholar] [CrossRef
[24] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K. (2018) Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv: 1810.04805.
[25] Wang, W., Xie, E., Li, X., et al. (2021) Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, 10-17 October 2021, 548-558. [Google Scholar] [CrossRef
[26] Valanarasu, J.M.J. and Patel, V.M. (2022) UNeXt: Mlp-Based Rapid Medical Image Segmentation Network. 25th International Conference of Medical Image Computing and Computer As-sisted Intervention—MICCAI 2022, Singapore, 18-22 September2022, 23-33. [Google Scholar] [CrossRef
[27] Wang, Z., Cun, X., Bao, J., et al. (2022) Uformer: A General U-Shaped Transformer for Image Restoration. Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 18-24 June 2022, 17662-17672. [Google Scholar] [CrossRef
[28] Ronneberger, O., Fischer, P. and Brox, T. (2015) U-Net: Convolu-tional Networks for Biomedical Image Segmentation. 18th International Conference of Medical Image Computing and Com-puter-Assisted Intervention—MICCAI 2015, Munich, 5-9 October 2015, 234-241. [Google Scholar] [CrossRef
[29] Dong, X., Bao, J., Chen, D., et al. (2022) Cswin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows. Proceedings of the 2022 IEEE/CVF Conference on Com-puter Vision and Pattern Recognition, New Orleans, 18-24 June 2022, 12114-12124. [Google Scholar] [CrossRef
[30] Yuan, Y., Fu, R., Huang, L., et al. (2021) HRFormer: High-Resolution Transformer for Dense Prediction. arXiv: 2110.09408.
[31] Sun, K., Xiao, B., Liu, D. and Wang, J.D. (2019) Deep High-Resolution Representation Learning for Human Pose Estimation. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 15-20 June 2019, 5686-5696. [Google Scholar] [CrossRef
[32] Hinton, G., Vinyals, O. and Dean, J. (2015) Distilling the Knowledge in a Neural Network. arXiv: 1503.02531.
[33] 邵仁荣, 刘宇昂, 张伟, 等. 深度学习中知识蒸馏研究综述[J]. 计算机学报, 2022, 45(8): 1638-1673.
[34] 黄震华, 杨顺志, 林威, 等. 知识蒸馏研究综述[J]. 计算机学报, 2022, 45(3): 624-653.
[35] Touvron, H., Cord, M., Douze, M., et al. (2021) Training Data-Efficient Image Transformers & Distillation through Attention. Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18-24 July 2021, 10347-10357.
[36] Carion, N., Massa, F., Synnaeve, G., et al. (2020) End-to-End Object Detection with Transformers. 16th European Conference of Computer Vision—ECCV 2020, Glasgow, 23-28 August 2020, 213-229. [Google Scholar] [CrossRef
[37] Dai, Z., Liu, H., Le, Q.V. and Tan, M.X. (2021) CoAtNet: Marrying Convolution and Attention for All Data Sizes. 35th Conference on Neural Information Processing Systems, Virtual, 6-14 De-cember 2021, 3965-3977.
[38] Beal, J., Kim, E., Tzen, E., et al. (2020) Toward Transformer-Based Object Detection. arXiv: 2012.09958.
[39] Ren, S., He, K., Girshick, R. and Sun, J. (2015) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv: 1506.01497.
[40] Yan, H., Li, Z., Li, W., et al. (2021) Contnet: Why Not Use Convolution and Transformer at the Same Time? arXiv: 2104.13497.
[41] Mehta, S. and Rastegari, M. (2021) MobileVit: Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer. arXiv: 2110.02178.
[42] Mehta, S. and Rastegari, M. (2022) Sep-arable Self-Attention for Mobile Vision Transformers. arXiv: 2206.02680.
[43] Peng, Z., Huang, W., Gu, S., et al. (2021) Conformer: Local Features Coupling Global Representations for Visual Recognition. Proceedings of the 2021 IEEE/CVF Inter-national Conference on Computer Vision, Montreal, 10-17 October 2021, 357-366. [Google Scholar] [CrossRef
[44] Chen, Y., Dai, X., Chen, D., et al. (2022) Mobile-Former: Bridging Mobilenet and Transformer. Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 18-24 June 2022, 5260-5269. [Google Scholar] [CrossRef
[45] Yoo, J., Kim, T., Lee, S., et al. (2023) Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 2-7 January 2023, 4945-4954. [Google Scholar] [CrossRef
[46] Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B. and Fu, Y. (2018) Image Super-Resolution Using Very Deep Residual Channel Attention Net-works. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., Computer Vision— ECCV 2018, Springer, Cham, 294-310. [Google Scholar] [CrossRef
[47] Xiao, T., Singh, M., Mintun, E., et al. (2021) Early Convo-lutions Help Transformers See Better. arXiv: 2106.14881.
[48] Hassani, A., Walton, S., Shah, N., et al. (2021) Escaping the Big Data Paradigm with Compact Transformers. arXiv: 2104.05704.
[49] Li, Y.W., Zhang, K., Cao, J.Z., Timofte, R. and Van Gool, L. (2021) LocalViT: Bringing Locality to Vision Transformers. arXiv: 2104.05707.
[50] D’Ascoli, S., Touvron, H., Leavitt, M.L., et al. (2021) ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases. arXiv: 2103.10697.
[51] Srinivas, A., Lin, T.Y., Parmar, N., et al. (2021) Bottleneck Transformers for Visual Recognition. Proceed-ings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, 20-25 June 2021, 16514-16524. [Google Scholar] [CrossRef
[52] Graham, B., El-Nouby, A., Touvron, H., et al. (2021) LeViT: A Vision Transformer in Convnet’s Clothing for Faster Inference. Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, 10-17 October 2021, 12239-12249. [Google Scholar] [CrossRef
[53] Wu, H., Xiao, B., Codella, N., et al. (2021) CvT: Introducing Con-volutions to Vision Transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, 10-17 October 2021, 22-31. [Google Scholar] [CrossRef
[54] Yuan, K., Guo, S., Liu, Z., et al. (2021) Incorporating Convolution Designs into Visual Transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, 10-17 October 2021, 559-568. [Google Scholar] [CrossRef
[55] Jeevan, P. (2022) Convolutional Xformers for Vision. arXiv: 2201.10271.
[56] Pan, J., Bulat, A., Tan, F., et al. (2022) EdgeViTs: Competing Light-Weight CNNs on Mobile Devices with Vision Transformers. 17th European Conference of Computer Vision—ECCV 2022, Tel Aviv, 23-27 October 2022, 294-311. [Google Scholar] [CrossRef
[57] Guo, J., Han, K., Wu, H., et al. (2022) CMT: Convolutional Neural Networks Meet Vision Transformers. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 18-24 June 2022, 12165- 12175. [Google Scholar] [CrossRef
[58] Zhang, H., Hu, W. and Wang. X. (2022) ParC-Net: Position Aware Circular Convolution with Merits from Convnets and Transformer. 17th Euro-pean Conference of Computer Vision—ECCV 2022, Tel Aviv, 23-27 October 2022, 613-630. [Google Scholar] [CrossRef
[59] Yu, W., Luo, M., Zhou, P., et al. (2022) Metaformer Is Actually What You Need for Vision. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Or-leans, 18-24 June 2022, 10809-10819. [Google Scholar] [CrossRef
[60] Li, J., Xia, X., Li, W., et al. (2022) Next-ViT: Next Generation Vi-sion Transformer for Efficient Deployment in Realistic Industrial Scenarios. ArXiv: 2207.05501.
[61] Maaz, M., Shaker, A., Cholakkal, H., et al. (2023) Edgenext: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications. Computer Vision—ECCV 2022 Workshops, Tel Aviv, 23-27 October 2022, 3-20. [Google Scholar] [CrossRef
[62] Pan, X., Ge, C., Lu, R., et al. (2022) On the Integration of Self-Attention and Convolution. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 18-24 June 2022, 805-815. [Google Scholar] [CrossRef
[63] Lin, T.Y., Goyal, P., Girshick, R., He, K.M. and Dollár, P. (2017) Focal Loss for Dense Object Detection. Proceedings of the IEEE International Conference on Computer Vision, Venice, 22-29 October 2017, 2999-3007. [Google Scholar] [CrossRef