混合CNN和ViT的自监督知识蒸馏单目深度估计方法

doi:10.12677/mos.2024.133260

期刊菜单

混合CNN和ViT的自监督知识蒸馏单目深度估计方法
Hybrid CNN and ViT for Self-Supervised Knowledge Distillation Monocular Depth Estimation Method

DOI: 10.12677/mos.2024.133260, PDF,
作者: 郑千惠：上海理工大学出版印刷与艺术设计学院，上海；孔玲君：上海出版印刷高等专科学校，上海
关键词: 单目深度估计；自监督学习；知识蒸馏；Vision Transformer；Monocular Depth Estimation； Self-Supervised Learning； Knowledge Distillation； Vision Transformer

摘要: 单目深度估计是一项具有挑战性的任务，现有的方法无法高效利用特征的长程相关性和局部信息。针对该问题，本文提出一种混合CNN和ViT (Vision Transformer)的自监督知识蒸馏单目深度估计方法HCVNet。HCVNet对CNN和Vision Transformer的有效组合进行研究，设计了CNN-ViT混合特征编码器，来建模局部和全局上下文信息，提取更具场景表达性的细节特征。采用通道特征聚合模块来捕获长距离依赖，通过在通道维度上聚合区分度高的特征，来增强场景结构的感知能力。引入自监督知识蒸馏，利用结构相同的教师模型为学生模型的训练提供更多监督信号，进一步提高网络性能。在KITTI和Make3D数据集上的实验结果表明，本方法的深度估计性能优于目前的主流方法，且具有较强的泛化能力，能够更好地估计出结构完整细节清晰的深度图。

Abstract: Monocular depth estimation is a challenging task, and existing methods cannot efficiently utilize feature long-range correlation and local information. To address this problem, this paper proposes HCVNet, a hybrid CNN and ViT (Vision Transformer) method for self-supervised knowledge distillation monocular depth estimation. HCVNet investigates the effective combination of CNN and Vision Transformer, and designs a hybrid CNN-ViT feature encoder to model local and global contextual information and extract more scene-expressive detailed features. Channel feature aggregation module is employed to capture long-range dependencies and enhance the perception of scene structure by aggregating discriminative features in the channel dimension. Self-supervised knowledge distillation is introduced to provide more supervised signals for the training of student models using structurally identical teacher models to further improve network performance. Experimental results on KITTI and Make3D datasets confirm that the depth estimation performance of this method is better than the current mainstream methods and has strong generalization ability, which can better estimate the depth map with complete structure and clear details.

文章引用：郑千惠, 孔玲君. 混合CNN和ViT的自监督知识蒸馏单目深度估计方法[J]. 建模与仿真, 2024, 13(3): 2868-2880. https://doi.org/10.12677/mos.2024.133260

参考文献

[1]	Xu, Q., Tan, C., Xue, T., et al. (2021) Overview of Monocular Depth Estimation Based on Deep Learning. 5th International Conference on Cognitive Systems and Signal Processing (ICCSIP 2020), Zhuhai, 25-27 December 2020, 499-506. [Google Scholar] [CrossRef]
[2]	Sun, J., Xie, Y., Chen, L., et al. (2021) Neuralrecon: Real-Time Coherent 3D Reconstruction from Monocular Video. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, 20-25 June 2021, 15593-15602. [Google Scholar] [CrossRef]
[3]	Luo, Y., Ren, J.S.J., Lin, M., et al. (2018) Single View Stereo Matching. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 155-163. [Google Scholar] [CrossRef]
[4]	Zhang, Z., Xu, C., Yang, J., et al. (2018) Progressive Hard-Mining Network for Monocular Depth Estimation. IEEE Transactions on Image Processing, 27, 3691-3702. [Google Scholar] [CrossRef]
[5]	Wang, Z. (2022) Self-Supervised Learning in Computer Vision: A Review. 12th International Conference on Computer Engineering and Networks (CENet 2022), Haikou, 4-7 November 2022, 1112-1121. [Google Scholar] [CrossRef]
[6]	Wang, R., Yu, Z. and Gao, S. (2023) PlaneDepth: Self-Supervised Depth Estimation via Orthogonal Planes. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, 18-22 June 2023, 21425-21434. [Google Scholar] [CrossRef]
[7]	Godard, C., Aodha, O.M., Firman, M., et al. (2019) Digging into Self-Supervised Monocular Depth Estimation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 3827-3837. [Google Scholar] [CrossRef]
[8]	Lyu, X., Liu, L., Wang, M., et al. (2021) HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 35, 2294-2301. [Google Scholar] [CrossRef]
[9]	Wang, F. and Cheng, J. (2023) HQDec: Self-Supervised Monocular Depth Estimation Based on a High-Quality Decoder. arXiv: 2305.18706.
[10]	Ren, W., Wang, L., Piao, Y., et al. (2022) Adaptive Co-Teaching for Unsupervised Monocular Depth Estimation. 17th European Conference on Computer Vision, Tel Aviv, 23-27 October 2022, 89-105. [Google Scholar] [CrossRef]
[11]	Yan, J., Zhao, H., Bu, P., et al. (2021) Channel-Wise Attention-Based Network for Self-Supervised Monocular Depth Estimation. 9th International Conference on 3D Vision, London, 1-3 December 2021, 464-473. [Google Scholar] [CrossRef]
[12]	Zhao, C., Zhang, Y., Poggi, M., et al. (2022) MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer. 10th International Conference on 3D Vision, Prague, 12-16 September 2022, 668-678. [Google Scholar] [CrossRef]
[13]	Liu, Z., Lin, Y., Cao, Y., et al. (2021) Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. 18th IEEE/CVF International Conference on Computer Vision, Montreal, 10-17 October 2021, 9992-10002. [Google Scholar] [CrossRef]
[14]	Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. 31st Annual Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 5999-6009.
[15]	Kim, K., Ji, B., Yoon, D., et al. (2021) Self-Knowledge Distillation with Progressive Refinement of Targets. 18th IEEE/CVF International Conference on Computer Vision, Montreal, 10-17 October 2021, 6547-6556. [Google Scholar] [CrossRef]
[16]	Geiger, A., Lenz, P., Stiller, C., et al. (2013) Vision Meets Robotics: The KITTI Dataset. International Journal of Robotics Research, 32, 1231-1237. [Google Scholar] [CrossRef]
[17]	Saxena, A., Sun, M. and Ng, A.Y. (2009) Make3D: Learning 3D Scene Structure from a Single Still Image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 824-840. [Google Scholar] [CrossRef]
[18]	Shim, D. and Kim, H.J. (2023) SwinDepth: Unsupervised Depth Estimation Using Monocular Sequences via Swin Transformer and Densely Cascaded Network. 2023 IEEE International Conference on Robotics and Automation, London, 29 May 2023-2 June 2023, 4983-4990. [Google Scholar] [CrossRef]
[19]	Eigen, D., Puhrsch, C. and Fergus, R. (2014) Depth Map Prediction from a Single Image Using a Multi-Scale Deep Network. 28th Annual Conference on Neural Information Processing Systems 2014, Montreal, 8-13 December 2014, 2366-2374.
[20]	Sun, Q., Tang, Y., Zhang, C., et al. (2022) Unsupervised Estimation of Monocular Depth and VO in Dynamic Environments via Hybrid Masks. IEEE Transactions on Neural Networks and Learning Systems, 33, 2023-2033. [Google Scholar] [CrossRef]
[21]	Klingner, M., Termohlen, J.A., Mikolajczyk, J., et al. (2020) Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance. 16th European Conference on Computer Vision, Glasgow, 23-28 August 2020, 582-600. [Google Scholar] [CrossRef]
[22]	Choi, J., Jung, D., Lee, D.H., et al. (2020) Self-Supervised Monocular Depth Estimation with Semantic-Aware Depth Features. arXiv: 2010.02893.
[23]	Zhou, H., Greenwood, D., Taylor, S., et al. (2020) Constant Velocity Constraints for Self-Supervised Monocular Depth Estimation. 17th ACM SIGGRAPH European Conference on Visual Media Production, 7-8 December 2020, 1-8. [Google Scholar] [CrossRef]
[24]	Rares, V.G., Ambrus Pillai, S., et al. (2020) 3D Packing for Self-Supervised Monocular Depth Estimation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14-19 June 2020, 2482-2491.
[25]	Poggi, M., Aleotti, F., Tosi, F., et al. (2020) On the Uncertainty of Self-Supervised Monocular Depth Estimation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 13-19 June 2020, 3224-3234. [Google Scholar] [CrossRef]
[26]	Johnston, A. and Carneiro, G. (2020) Self-Supervised Monocular Trained Depth Estimation Using Self-Attention and Discrete Disparity Volume. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 13-19 June 2020, 4755-4764. [Google Scholar] [CrossRef]
[27]	Zhou, H., Greenwood, D. and Taylor, S. (2021) Self-Supervised Monocular Depth Estimation with Internal Feature Fusion. 32nd British Machine Vision Conference, 22-25 November 2021, 730-734.
[28]	Bae, J.H., Moon, S. and Im, S. (2022) Deep Digging into the Generalization of Self-Supervised Monocular Depth Estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 187-196. [Google Scholar] [CrossRef]
[29]	Zhang, N., Nex, F., Vosselman, G., et al. (2023) Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, 17-24 June 2023, 18537-18546. [Google Scholar] [CrossRef]

为你推荐

友情链接