基于轻量孪生网络的无人机RGB-T目标跟踪算法
RGB-T Tracking Algorithm for Unmanned Aerial Vehicles Based on Lightweight Siamese Network
摘要: 随着无人机在目标跟踪领域的广泛应用,尤其是在复杂环境(如低光照、恶劣天气)下的追踪效果难以保障,可见光与热红外(RGB-T)双模态数据融合成为提升跟踪性能的关键手段。然而,这种融合面临异构特征高效交互、视角差异及计算资源受限等挑战。本文提出一种基于孪生网络的轻量化目标跟踪算法SiamTSA (Siamese Network with Temporal and Spatial Attention)。首先,采用改进的MobileNetV3-small作为主干网络,降低计算开销并适配无人机平台;其次,设计跨模态时空交互注意力模块,通过时间注意力建模视觉风格差异和空间注意力对齐视角差异,抑制冗余噪声并增强跨模态一致性特征表达;进一步提出双模态自适应惩罚选择模块,通过分析预测框的尺度与宽高比变化筛选更优输出框,提升了跟踪框的稳定性。在GTOT、RGBT234及VTUAV数据集上的实验表明,SiamTSA在跟踪成功率(VTUAV: 67.5%)与实时性(56.3 FPS)方面均优于主流算法,兼顾精度与效率。本文方法为复杂场景下的无人机多模态目标跟踪提供了轻量化解决方案。
Abstract: With the widespread application of unmanned aerial vehicles (UAVs) in object tracking, especially the increasing demand for robust performance in complex environments (e.g., low-light conditions, adverse weather), the fusion of visible and thermal infrared (RGB-T) multimodal data has become a critical approach to enhance tracking accuracy. However, this fusion faces challenges such as efficient interaction of heterogeneous features, perspective differences, and limited computational resources. This paper proposes a lightweight object tracking algorithm named SiamTSA (Siamese Network with Temporal and Spatial Attention). First, an improved MobileNetV3-small is adopted as the backbone to reduce computational costs and adapt to UAV platforms. Second, a cross-modal temporal spatial interaction attention module is designed to model visual style differences via temporal attention and align spatial discrepancies via spatial attention, thereby suppressing redundant noise and enhancing cross-modal consistent feature representation. Furthermore, a dual-modal adaptive penalty selection module enhances tracking stability by selecting optimal bounding boxes through analysis of scale and aspect ratio variations. Experiments on GTOT, RGBT234, and VTUAV datasets demonstrate that SiamTSA outperforms state-of-the-art methods in tracking success rate (VTUAV: 67.5%) and real-time performance (56.3 FPS), balancing accuracy and efficiency. The proposed method provides a lightweight solution for UAV-based multimodal object tracking in complex scenarios.
文章引用:刘哲宇, 魏赟. 基于轻量孪生网络的无人机RGB-T目标跟踪算法[J]. 建模与仿真, 2025, 14(6): 99-109. https://doi.org/10.12677/mos.2025.146479

参考文献

[1] 卓力, 张时雨, 张辉, 等. 无人机影像单目标跟踪综述[J]. 北京工业大学学报, 2021, 47(10): 1174-1187.
[2] 张天路, 张强. 基于深度学习的RGB-T目标跟踪技术综述[J]. 模式识别与人工智能, 2023, 36(4): 327-353.
[3] Zhang, X., Ye, P., Peng, S., Liu, J., Gong, K. and Xiao, G. (2019) SiamFT: An RGB-Infrared Fusion Tracking Method via Fully Convolutional Siamese Networks. IEEE Access, 7, 122122-122133. [Google Scholar] [CrossRef
[4] Zhang, T., Liu, X., Zhang, Q. and Han, J. (2022) SiamCDA: Complementarity and Distractor-Aware RGB-T Tracking Based on Siamese Network. IEEE Transactions on Circuits and Systems for Video Technology, 32, 1403-1417. [Google Scholar] [CrossRef
[5] Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L., Tan, M., et al. (2019) Searching for MobileNetV3. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 1314-1324. [Google Scholar] [CrossRef
[6] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. and Chen, L. (2018). MobileNetV2: Inverted Residuals and Linear Bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 4510-4520.[CrossRef
[7] Zhang, Z. and Peng, H. (2019) Deeper and Wider Siamese Networks for Real-Time Visual Tracking. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 4586-4595. [Google Scholar] [CrossRef
[8] Wei, J., Sun, K., Li, W., Li, W., Gao, S., Miao, S., et al. (2024) Robust Change Detection for Remote Sensing Images Based on Temporospatial Interactive Attention Module. International Journal of Applied Earth Observation and Geoinformation, 128, Article 103767. [Google Scholar] [CrossRef
[9] Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J. and Yan, J. (2019) SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 4277-4286. [Google Scholar] [CrossRef
[10] Hu, W., Wang, Q., Zhang, L., Bertinetto, L. and Torr, P.H. (2023) SiamMask: A Framework for Fast Online Object Tracking and Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 3072-3089.
[11] Guo, C. and Xiao, L. (2022) High Speed and Robust RGB-Thermal Tracking via Dual Attentive Stream Siamese Network. 2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, 17-22 July 2022, 803-806. [Google Scholar] [CrossRef
[12] Li, C., Xue, W., Jia, Y., Qu, Z., Luo, B., Tang, J., et al. (2022) Lasher: A Large-Scale High-Diversity Benchmark for RGBT Tracking. IEEE Transactions on Image Processing, 31, 392-404. [Google Scholar] [CrossRef] [PubMed]
[13] Zhang, P., Zhao, J., Wang, D., Lu, H. and Ruan, X. (2022) Visible-Thermal UAV Tracking: A Large-Scale Benchmark and New Baseline. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 8876-8885. [Google Scholar] [CrossRef
[14] Li, C., Cheng, H., Hu, S., Liu, X., Tang, J. and Lin, L. (2016) Learning Collaborative Sparse Representation for Grayscale-Thermal Tracking. IEEE Transactions on Image Processing, 25, 5743-5756. [Google Scholar] [CrossRef] [PubMed]
[15] Xiao, Y., Yang, M., Li, C., Liu, L. and Tang, J. (2022) Attribute-Based Progressive Fusion Network for RGBT Tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 36, 2831-2838. [Google Scholar] [CrossRef
[16] Zhu, Y., Li, C., Tang, J., Luo, B. and Wang, L. (2022) RGBT Tracking by Trident Fusion Network. IEEE Transactions on Circuits and Systems for Video Technology, 32, 579-592. [Google Scholar] [CrossRef
[17] Li, C.L., Lu, A., Zheng, A.H., Tu, Z. and Tang, J. (2019) Multi-Adapter RGBT Tracking. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, 27-28 October 2019, 2262-2270. [Google Scholar] [CrossRef
[18] Zhu, Y., Li, C., Luo, B., Tang, J. and Wang, X. (2019) Dense Feature Aggregation and Pruning for RGBT Tracking. Proceedings of the 27th ACM International Conference on Multimedia, Nice, 21-25 October 2019, 465-472. [Google Scholar] [CrossRef
[19] Yun, S., Choi, J., Yoo, Y., Yun, K. and Choi, J.Y. (2017) Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 1349-1358. [Google Scholar] [CrossRef
[20] Li, C., Liang, X., Lu, Y., Zhao, N. and Tang, J. (2019) RGB-T Object Tracking: Benchmark and Baseline. Pattern Recognition, 96, Article 106977. [Google Scholar] [CrossRef
[21] Nam, H. and Han, B. (2016) Learning Multi-Domain Convolutional Neural Networks for Visual Tracking. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 4293-4302. [Google Scholar] [CrossRef
[22] Kristan, M., Matas, J., Leonardis, A., et al. (2019) The Seventh Visual Object Tracking VOT 2019 Challenge Results. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, 27-28 October 2019, 2206-2241.
[23] Zhang, L., Danelljan, M., Gonzalez-Garcia, A., van de Weijer, J. and Shahbaz Khan, F. (2019) Multi-Modal Fusion for End-to-End RGB-T Tracking. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, 27-28 October 2019, 2252-2261. [Google Scholar] [CrossRef
[24] Gao, Y., Li, C., Zhu, Y., Tang, J., He, T. and Wang, F. (2019) Deep Adaptive Fusion Network for High Performance RGBT Tracking. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, 27-28 October 2019, 91-99. [Google Scholar] [CrossRef
[25] Zhang, P., Wang, D., Lu, H. and Yang, X. (2021) Learning Adaptive Attribute-Driven Representation for Real-Time RGB-T Tracking. International Journal of Computer Vision, 129, 2714-2729. [Google Scholar] [CrossRef