基于未来潜状态交叉注意力的多模态轨迹预测方法

doi:10.12677/airr.2026.152064

期刊菜单

基于未来潜状态交叉注意力的多模态轨迹预测方法
A Multimodal Trajectory Prediction Method Based on Future Latent State Cross-Attention

DOI: 10.12677/airr.2026.152064, PDF,
作者: 王其程：西华大学汽车与交通学院，四川成都
关键词: 轨迹预测；多模态轨迹；时空交互；未来潜状态；Trajectory Prediction； Multimodal Trajectory； Spatiotemporal Interaction； Future Latent State

摘要: 准确预测多个交通参与者的未来运动轨迹对于实现安全可靠的自动驾驶至关重要。尽管近年来的轨迹预测方法通过建模智能体交互已展现出优异性能，但现有方法主要关注历史交互而忽视了智能体未来运动之间可能出现的复杂依赖关系，这导致预测轨迹在密集交通场景中可能缺乏全局交互一致性。为了应对这一挑战，本文设计了一种基于未来轨迹交互建模的轨迹预测算法框架，通过考虑未来潜在空间内的多模态交互关系，在解码阶段对其进行显式建模。具体而言，本方法将轨迹预测分解为基于历史的粗粒度预测和基于全局交互一致性的未来轨迹优化两个阶段。第一阶段通过矢量化场景表征，采用分层式编码机制融合局部上下文特征与全局场景交互特征，从而实现对各种交通场景中丰富的历史时空交互进行建模。为捕捉历史观测之外的交互关系，第二阶段进一步在解码器中引入未来潜在状态的空间交叉注意力模块(Future Latent Cross-Attention, FLCA)，设计了跨智能体的交互掩码机制，使每个预测模态都能关注其他智能体的未来运动，同时自身不同模态之间不出现干涉。最后，在大规模自动驾驶基准数据集Argoverse1上的实验表明，本方法能够实现更具全局交互一致性和准确性的轨迹预测。

Abstract: Accurate prediction of the future movement trajectories of multiple traffic participants is crucial for achieving safe and reliable autonomous driving. Although recent trajectory prediction methods have demonstrated excellent performance by modeling agent interactions, existing approaches mainly focus on historical interactions while neglecting the complex dependencies that may arise between agents’ future movements. This leads to the possibility that the predicted trajectories may lack global interaction consistency in dense traffic scenarios. To address this challenge, this paper designs a trajectory prediction algorithm framework based on future trajectory interaction modeling, explicitly modeling the multimodal interaction relationships in the future potential space during the decoding stage. Specifically, this method decomposes trajectory prediction into two stages: coarse-grained prediction based on history and future trajectory optimization based on global interaction consistency. In the first stage, a vectorized scene representation is used, and a hierarchical encoding mechanism is adopted to fuse local context features and global scene interaction features, thereby enabling the modeling of rich historical spatiotemporal interactions in various traffic scenarios. To capture interaction relationships beyond historical observations, the second stage further introduces a Future Latent Cross-Attention (FLCA) module in the decoder and designs an interaction masking mechanism across agents, allowing each prediction modality to focus on the future movements of other agents while avoiding interference among different modalities of the same agent. Finally, experiments on the large-scale autonomous driving benchmark dataset Argoverse1 show that this method can generate trajectory prediction results with better global interaction consistency and accuracy.

文章引用：王其程. 基于未来潜状态交叉注意力的多模态轨迹预测方法[J]. 人工智能与机器人研究, 2026, 15(2): 672-683. https://doi.org/10.12677/airr.2026.152064

参考文献

[1]	Zhou, Z., Ye, L., Wang, J., Wu, K. and Lu, K. (2022) HiVT: Hierarchical Vector Transformer for Multi-Agent Motion Prediction. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 8813-8823. [Google Scholar] [CrossRef]
[2]	Zhou, Z., Wang, J., Li, Y. and Huang, Y. (2023) Query-Centric Trajectory Prediction. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 17863-17873. [Google Scholar] [CrossRef]
[3]	Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., et al. (2017) Attention Is All You Need. Advances in Neural Information Processing Systems (NIPS), Long Beach, 4 December 2017, 30.
[4]	Bruna, J., Zaremba, W., Szlam, A. and LeCun, Y. (2014) Spectral Networks and Locally Connected Networks on Graphs. Proceedings of the International Conference on Learning Representations (ICLR), Banf, 14-16 April 2014. arXiv:1312.6203, 2013
[5]	Kipf, T.N. and Welling, M. (2017) Semi-Supervised Classification with Graph Convolutional Networks. 2017 Proceedings of the International Conference on Learning Representations (ICLR), Toulon, 24-26 April 2017. arXiv:1609.02907, 2016
[6]	Hu, J. and Zheng, W. (2020) Multistage Attention Network for Multivariate Time Series Prediction. Neurocomputing, 383, 122-137. [Google Scholar] [CrossRef]
[7]	杨超. 自动驾驶汽车行为预测综述[J]. 汽车文摘, 2022(10): 11-18.
[8]	Toledo-Moreo, R. and Zamora-Izquierdo, M.A. (2009) Imm-Based Lane-Change Prediction in Highways with Low-Cost GPS/INS. IEEE Transactions on Intelligent Transportation Systems, 10, 180-185. [Google Scholar] [CrossRef]
[9]	Xie, G., Gao, H., Qian, L., Huang, B., Li, K. and Wang, J. (2018) Vehicle Trajectory Prediction by Integrating Physics-and Maneuver-Based Approaches Using Interactive Multiple Models. IEEE Transactions on Industrial Electronics, 65, 5999-6008. [Google Scholar] [CrossRef]
[10]	Firl, J., Stubing, H., Huss, S.A. and Stiller, C. (2012) Predictive Maneuver Evaluation for Enhancement of Car-to-X Mobility Data. 2012 IEEE Intelligent Vehicles Symposium, Madrid, 3-7 June 2012, 558-564. [Google Scholar] [CrossRef]
[11]	Laugier, C., Paromtchik, I.E., Perrollaz, M., Yong, M.Y., Yoder, J., Tay, C., et al. (2011) Probabilistic Analysis of Dynamic Scenes and Collision Risks Assessment to Improve Driving Safety. IEEE Intelligent Transportation Systems Magazine, 3, 4-19. [Google Scholar] [CrossRef]
[12]	Aoude, G.S., Luders, B.D., Lee, K.K.H., Levine, D.S. and How, J.P. (2010) Threat Assessment Design for Driver Assistance System at Intersections. 13th International IEEE Conference on Intelligent Transportation Systems, Funchal, 19-22 September 2010, 1855-1862. [Google Scholar] [CrossRef]
[13]	Hulnhagen, T., Dengler, I., Tamke, A., Dang, T. and Breuel, G. (2010) Maneuver Recognition Using Probabilistic Finite-State Machines and Fuzzy Logic. 2010 IEEE Intelligent Vehicles Symposium, La Jolla, 21-24 June 2010, 65-70. [Google Scholar] [CrossRef]
[14]	郭景华, 何智飞, 罗禹贡, 等. 人机混驾环境下基于深度学习的车辆切入轨迹预测[J]. 汽车工程, 2022, 44(2): 153-160.
[15]	Chandra, R., Guan, T., Panuganti, S., Mittal, T., Bhattacharya, U., Bera, A., et al. (2020) Forecasting Trajectory and Behavior of Road-Agents Using Spectral Clustering in Graph-LSTMS. IEEE Robotics and Automation Letters, 5, 4882-4890. [Google Scholar] [CrossRef]
[16]	Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., et al. (2016) Social LSTM: Human Trajectory Prediction in Crowded Spaces. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 961-971. [Google Scholar] [CrossRef]
[17]	Messaoud, K., Yahiaoui, I., Verroust-Blondet, A. and Nashashibi, F. (2021) Attention Based Vehicle Trajectory Prediction. IEEE Transactions on Intelligent Vehicles, 6, 175-185. [Google Scholar] [CrossRef]
[18]	Huang, Z.Y., Mo, X.Y. and Lv, C. (2021) Multi-Modal Motion Prediction with Transformer-Based Neural Network for Autonomous Driving.
[19]	Gao, J.Y., Sun, C., Zhao, H., Shen, Y., et al. (2020) VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 16-18 June 2020, 11525-11533.
[20]	Gu, J.R., Sun, C. and Zhao, H. (2021) Densetnt: End-to-End Trajectory Prediction from Dense Goal Sets. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 15303-15312.
[21]	Liu, Y.C., Zhang, J.H., Fang, L.J., et al. (2021) Multimodal Motion Prediction with Stacked Transformers. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 7577-7586.
[22]	Liang, M., Yang, B., Hu, R., Chen, Y., Liao, R., Feng, S., et al. (2020) Learning Lane Graph Representations for Motion Forecasting. In European Conference on Computer Vision, Springer International Publishing, 541-556. [Google Scholar] [CrossRef]
[23]	Giuliari, F., Hasan, I., Cristani, M. and Galasso, F. (2020) Transformer Networks for Trajectory Forecasting. 25th International Conference on Pattern Recognition (ICPR), Milan, 10-15 January 2020, 10335-10342.
[24]	Casas, S., Gulino, C., Liao, R. and Urtasun, R. (2020) SpAGNN: Spatially-Aware Graph Neural Networks for Relational Behavior Forecasting from Sensor Data. 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, 31 May-31 August 2020, 9491-9497. [Google Scholar] [CrossRef]
[25]	Chang, M., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., et al. (2019) Argoverse: 3D Tracking and Forecasting with Rich Maps. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 8748-8757. [Google Scholar] [CrossRef]

为你推荐

友情链接