面向BEVFormer时空信息融合的层级化可解释性分析
Hierarchical Explainability Analysis of Spatiotemporal Information Fusion in BEVFormer
摘要: 针对BEVFormer在多摄像头空间信息与历史BEV时序信息融合过程中的决策依据不透明问题,本文基于Generic Attention-model Explainability (GAE)的相关性传播思想,构建面向BEVFormer的层级化归因解释方法。由于BEVFormer具有不同于标准Transformer的时空多阶段信息传播结构,本文分别建立检测查询到当前BEV、当前BEV到图像特征以及当前BEV到历史BEV的归因链路。针对可变形注意力难以直接形成规则注意力矩阵的问题,结合采样点权重及其梯度,将采样点级正贡献映射至规则BEV网格或图像特征空间。基于nuScenes数据集的实验结果表明,本文方法能够定位目标相关BEV区域、关键摄像头视角和局部历史BEV区域;忠诚度实验和单分支全遮挡实验进一步表明,图像特征主要支撑类别与几何属性估计,历史BEV特征对速度估计和运动连续性保持具有更明显作用。
Abstract: To address the opaque decision basis of BEVFormer in fusing multi-camera spatial information and historical BEV temporal information, this paper constructs a hierarchical attribution explanation method for BEVFormer based on the relevance propagation idea of Generic Attention-model Explainability (GAE). Since BEVFormer has a spatiotemporal multi-stage information propagation structure that differs from standard Transformers, this paper establishes attribution paths from detection queries to current BEV features, from current BEV features to image features, and from current BEV features to historical BEV features. To overcome the difficulty that deformable attention cannot directly form regular attention matrices, sampling-point-level positive contributions are mapped to regular BEV grids or image feature spaces by combining sampling weights and their gradients. Experimental results on the nuScenes dataset show that the proposed method can locate target-related BEV regions, key camera views, and local historical BEV regions. Faithfulness experiments and single-branch full-masking experiments further indicate that image features mainly support category and geometric attribute estimation, while historical BEV features play a more significant role in velocity estimation and motion continuity preservation.
文章引用:禹鑫. 面向BEVFormer时空信息融合的层级化可解释性分析[J]. 计算机科学与应用, 2026, 16(6): 140-154. https://doi.org/10.12677/csa.2026.166215

参考文献

[1] Mao, J.G., Shi, S.S., Wang, X.G., et al. (2023) 3D Object Detection for Autonomous Driving: A Comprehensive Survey. International Journal of Computer Vision, 131, 1909-1963. [Google Scholar] [CrossRef
[2] Roddick, T., Kendall, A. and Cipolla, R. (2019) Orthographic Feature Transform for Monocular 3D Object Detection. Proceedings of the British Machine Vision Conference, Cardiff, 9-12 September 2019, Article No. 285.
[3] Philion, J. and Fidler, S. (2020) Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D. In: Vedaldi, A., et al., Eds., Proceedings of the European Conference on Computer Vision, Springer International Publishing, 194-210. [Google Scholar] [CrossRef
[4] Huang, J.J., Huang, G., Zhu, Z., et al. (2021) BEVDet: High-Performance Multi-Camera 3D Object Detection in Bird-Eye-View.
https://arxiv.org/abs/2112.11790
[5] Li, Y., Ge, Z., Yu, G., Yang, J., Wang, Z., Shi, Y., et al. (2023) BEVDepth: Acquisition of Reliable Depth for Multi-View 3D Object Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 1477-1485. [Google Scholar] [CrossRef
[6] Li, Y., Bao, H., Ge, Z., Yang, J., Sun, J. and Li, Z. (2023) BEVStereo: Enhancing Depth Estimation in Multi-View 3D Object Detection with Temporal Stereo. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 1486-1494. [Google Scholar] [CrossRef
[7] Huang, J.J. and Huang, G. (2022) BEVDet4D: Exploit Temporal Cues in Multi-Camera 3D Object Detection.
https://arxiv.org/abs/2203.17054
[8] Li, Z.Q., Wang, W.H., Li, H.Y., et al. (2022) BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers. In: Avidan, S., et al., Eds., Proceedings of the European Conference on Computer Vision, Springer, 1-18. [Google Scholar] [CrossRef
[9] Simonyan, K., Vedaldi, A. and Zisserman, A. (2014) Deep inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. Workshop at International Conference on Learning Representations, Banff, 14-16 April 2014, 1-8. [Google Scholar] [CrossRef
[10] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D. and Batra, D. (2017) Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 618-626. [Google Scholar] [CrossRef
[11] Sundararajan, M., Taly, A. and Yan, Q.Q. (2017) Axiomatic Attribution for Deep Networks. Proceedings of the 34th International Conference on Machine Learning, Sydney, 6-11 August 2017, 3319-3328.
[12] Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K. and Samek, W. (2015) On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLOS ONE, 10, e0130140. [Google Scholar] [CrossRef] [PubMed]
[13] Shrikumar, A., Greenside, P. and Kundaje, A. (2017) Learning Important Features through Propagating Activation Differences. Proceedings of the 34th International Conference on Machine Learning, Sydney, 6-11 August 2017, 3145-3153.
[14] Ribeiro, M.T., Singh, S. and Guestrin, C. (2016) “Why Should I Trust You?” Explaining the Predictions of Any Classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, 13-17 August 2016, 1135-1144. [Google Scholar] [CrossRef
[15] Lundberg, S.M. and Lee, S.I. (2017) A Unified Approach to Interpreting Model Predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates, 4765-4774.
[16] Xu, K., Ba, J., Kiros, R., et al. (2015) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Proceedings of the 32nd International Conference on Machine Learning, Lille, 6-11 July 2015, 2048-2057.
[17] Choi, E., Bahadori, M.T., Sun, J., et al. (2016) RETAIN: An Interpretable Predictive Model for Healthcare Using Reverse Time Attention Mechanism. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, Curran Associates, 3504-3512.
[18] Jain, S. and Wallace, B.C. (2019) Attention Is Not Explanation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, 3543-3556.
[19] Abnar, S. and Zuidema, W. (2020) Quantifying Attention Flow in Transformers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 4190-4197. [Google Scholar] [CrossRef
[20] Chefer, H., Gur, S. and Wolf, L. (2021) Transformer Interpretability Beyond Attention Visualization. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 19-25 June 2021, 782-791. [Google Scholar] [CrossRef
[21] Chefer, H., Gur, S. and Wolf, L. (2021) Generic Attention-Model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 11-17 October 2021, 397-406. [Google Scholar] [CrossRef
[22] Ali, A., Schnake, T., Eberle, O., et al. (2022) XAI for Transformers: Better Explanations through Conservative Propagation. Proceedings of the 39th International Conference on Machine Learning, Baltimore, 17-23 July 2022, 435-451.
[23] Ferrando, J., Gállego, G.I., Tsiamas, I. and Costa-Jussà, M.R. (2023) Explaining How Transformers Use Context to Build Predictions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Volume 1, 5486-5513. [Google Scholar] [CrossRef
[24] Achtibat, R., Hatefi, S.M.V., Dreyer, M., et al. (2024) AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers. Proceedings of the 41st International Conference on Machine Learning, Vienna, 21-27 July 2024, 135-168.
[25] Arras, L., Puri, B., Kahardipraja, P., et al. (2025) A Close Look at Decomposition-Based XAI-Methods for Transformer Language Models.
https://arxiv.org/abs/2502.15886
[26] Petsiuk, V., Jain, R., Manjunatha, V., Morariu, V.I., Mehra, A., Ordonez, V., et al. (2021) Black-Box Explanation of Object Detectors via Saliency Maps. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 19-25 June 2021, 11443-11452. [Google Scholar] [CrossRef