基于特征点驱动的视觉Transformer驾驶行为分析
Feature Point-Driven Vision Transformer for Driving Behavior Analysis
DOI: 10.12677/mos.2025.145387, PDF,    国家自然科学基金支持
作者: 黄廷禾, 陈庆奎*:上海理工大学光电信息与计算机工程学院,上海
关键词: 驾驶行为分析目标检测注意力机制ViTDriver Behavior Analysis Object Detection Attention Mechanisms ViT
摘要: 针对Vision Tranformer (ViT)在局部特征捕捉和计算效率方面的局限性,文章提出了一种将目标检测技术和视觉神经网络分类模型融合到一起的方法。该方法针对驾驶行为分析特征点数量稀疏的特点,对ViT进行改进,提出了分阶段注意力计算策略,通过重构ViT编码层,将后五层的全局视觉特征序列替换为目标检测特征点序列,使其更适配基于特征点驱动的模型。并替换标准ViT的位置编码,引入方向盘相对距离–角度联合编码。此外,为了改善模型对面部捕捉的不足,在上述模型的基础上引入了多任务学习,加入了判断驾驶员面部是否直视前方这个子模型辅助判断,以更好地判断驾驶员的驾驶行为。这个模型称为特征点驱动的Vision Tranformer多任务学习模型(FViT-MTL),在SFDDD数据集上的准确率为93.85%,比其他主流视觉神经网络分类模型提升了5.71%,比目前应用于驾驶行为分析先进的方法提升了1.28%,有效提升了分类模型的准确率,并确保驾驶行为的判断准确。
Abstract: To address the limitations of Vision Transformer (ViT) in local feature capture and computational efficiency, this study proposes an integrated approach that fuses object detection technology with visual neural network classification models. Specifically, we introduce a phased attention computation strategy tailored for sparse feature point scenarios in driving behavior analysis. By reconstructing ViT’s encoder layers, we replace the global visual feature sequences in the last five layers with target-detected feature point sequences, making them more suitable for feature-driven models. Additionally, we substitute standard ViT positional encoding with a combined relative position-angle encoding of the steering wheel to enhance spatial-temporal context understanding. Furthermore, to compensate for suboptimal facial feature capture, a multimodal subsystem is introduced through multitask learning to detect driver head orientation. The proposed Feature-Driven Vision Transformer Multitask Learning (FViT-MTL) achieves 93.85% classification accuracy on the SFDDD dataset, demonstrating a 5.71% improvement over conventional visual neural networks and a 1.28% gain compared to state-of-the-art methods. The results validate our approach’s effectiveness in achieving both computational efficiency and precise driving behavior analysis.
文章引用:黄廷禾, 陈庆奎. 基于特征点驱动的视觉Transformer驾驶行为分析[J]. 建模与仿真, 2025, 14(5): 211-222. https://doi.org/10.12677/mos.2025.145387

参考文献

[1] Khandakar, A., Chowdhury, M.E.H., Ahmed, R., Dhib, A., Mohammed, M., Al-Emadi, N.A.M.A., et al. (2019) Portable System for Monitoring and Controlling Driver Behavior and the Use of a Mobile Phone While Driving. Sensors, 19, Article 1563. [Google Scholar] [CrossRef] [PubMed]
[2] Coxon, K. and Keay, L. (2015) Behind the Wheel: Community Consultation Informs Adaptation of Safe-Transport Program for Older Drivers. BMC Research Notes, 8, Article No. 764. [Google Scholar] [CrossRef] [PubMed]
[3] Lechner, G., Fellmann, M., Festl, A., Kaiser, C., Kalayci, T.E., Spitzer, M., et al. (2019) A Lightweight Framework for Multi-Device Integration and Multi-Sensor Fusion to Explore Driver Distraction. In: Giorgini, P. and Weber, B., Eds., Lecture Notes in Computer Science, Springer International Publishing, 80-95. [Google Scholar] [CrossRef
[4] Eraqi, H.M., Abouelnaga, Y., Saad, M.H. and Moustafa, M.N. (2019) Driver Distraction Identification with an Ensemble of Convolutional Neural Networks. Journal of Advanced Transportation, 2019, Article 4125865. [Google Scholar] [CrossRef
[5] Xing, Y., Lv, C., Wang, H., Cao, D., Velenis, E. and Wang, F. (2019) Driver Activity Recognition for Intelligent Vehicles: A Deep Learning Approach. IEEE Transactions on Vehicular Technology, 68, 5379-5390. [Google Scholar] [CrossRef
[6] Du, Y., Liu, X., Yi, Y. and Wei, K. (2023) Optimizing Road Safety: Advancements in Lightweight YOLOv8 Models and GhostC2f Design for Real-Time Distracted Driving Detection. Sensors, 23, Article 8844. [Google Scholar] [CrossRef] [PubMed]
[7] 周宏威, 王文博, 王伟光, 等. 基于改进YOLOv7算法的驾驶分心行为检测模型[J/OL]. 自动化技术与应用, 1-10.
http://kns.cnki.net/kcms/detail/23.1474.TP.20241230.0914.028.html, 2025-02-19.
[8] 王宝峰. 基于Faster-YOLO模型的驾驶员行为检测方法研究[D]: [硕士学位论文]. 重庆: 重庆理工大学, 2024.
[9] Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021) An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. The Ninth International Conference on Learning Representations, ICLR 2021, Virtual Event, 3-7 May 2021, 1-21.
[10] Ma, Y., Du, R., Abdelraouf, A., Han, K., Gupta, R. and Wang, Z. (2024) Driver Digital Twin for Online Recognition of Distracted Driving Behaviors. IEEE Transactions on Intelligent Vehicles, 9, 3168-3180. [Google Scholar] [CrossRef
[11] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., et al. (2021) Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 9992-10002. [Google Scholar] [CrossRef
[12] Li, F., Zeng, A., Liu, S., Zhang, H., Li, H., Zhang, L., et al. (2023) Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 18558-18567. [Google Scholar] [CrossRef
[13] Li, K., Xiong, H., Liu, J., Xu, Q. and Wang, J. (2022) Real-Time Monocular Joint Perception Network for Autonomous Driving. IEEE Transactions on Intelligent Transportation Systems, 23, 15864-15877. [Google Scholar] [CrossRef
[14] Gupta, A., Thakkar, K., Gandhi, V. and Narayanan, P.J. (2019) Nose, Eyes and Ears: Head Pose Estimation by Locating Facial Keypoints. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, 12-17 May 2019, 1977-1981. [Google Scholar] [CrossRef
[15] Khan, T., Choi, G. and Lee, S. (2023) EFFNet-CA: An Efficient Driver Distraction Detection Based on Multiscale Features Extractions and Channel Attention Mechanism. Sensors, 23, Article 3835. [Google Scholar] [CrossRef] [PubMed]
[16] Hossain, M.U., Rahman, M.A., Islam, M.M., Akhter, A., Uddin, M.A. and Paul, B.K. (2022) Automatic Driver Distraction Detection Using Deep Convolutional Neural Networks. Intelligent Systems with Applications, 14, Article 200075. [Google Scholar] [CrossRef