基于分层策略与世界模型的多智能体深度确定性策略梯度算法
Multi-Agent Deep Deterministic Policy Gradient Algorithm Based on Hierarchical Strategies and World Models
DOI: 10.12677/csa.2026.161009, PDF,    科研立项经费支持
作者: 张华东, 王友鑫, 王于婷, 侯恩广*:山东交通学院轨道交通学院,山东 济南;徐衍亮:山东大学电气工程学院,山东 济南
关键词: 无人机多智能体深度确定性策略梯度算法(MADDPG)层次化策略世界模型对比学习UAV Multi-Agent Deep Deterministic Policy Gradient (MADDPG) Hierarchical Strategy World Model Contrastive Learning
摘要: 针对三维环境中多无人机路径规划面临着样本效率低、长时程决策困难和鲁棒性不足等挑战,本文提出一种基于分层策略与世界模型增强的多智能体深度确定性策略梯度算法框架(HWC-MADDPG)。首先,引入对比学习机制,从高维观测中提取时序一致性的鲁棒状态表征,增强了状态表征的区分度;其次,设计多智能体层次化策略网络架构,通过高层策略网络规划宏观意图,低层策略网络执行具体动作的方式,将路径规划任务分解,提升决策能力;最后,集成共享的世界模型,通过其内在的前瞻性推演生成想象奖励,优化Critic网络的价值评估,提升了决策前瞻性和收敛速度。实验结果表明,本文提出的算法在学习速度、策略稳定性和飞行安全性上均优于传统的多智能体深度确定性策略梯度算法(MADDPG)。该研究为解决三维环境下的多智能体路径规划问题提供了一种更高效的解决方案,具有一定的理论价值与应用前景。
Abstract: Addressing challenges in multi-UAV path planning within 3D environments—such as low sample efficiency, difficulties in long-term decision-making, and insufficient robustness—this paper proposes a hierarchical strategy and world model-enhanced multi-agent deep deterministic policy gradient algorithm framework (HWC-MADDPG). First, a contrastive learning mechanism is introduced to extract temporally consistent robust state representations from high-dimensional observations, enhancing the discriminative power of state representations. Second, a hierarchical multi-agent policy network architecture is designed. By decomposing the path planning task—where the high-level policy network formulates macro-intentions and the low-level policy network executes specific actions—decision-making capabilities are enhanced. Finally, an integrated shared world model generates imagined rewards through its inherent forward-looking inference, optimizing the value assessment of the Critic network and improving decision foresight and convergence speed. Experimental results demonstrate that the proposed algorithm outperforms the traditional Multi-Agent Deep Deterministic Policy Gradient (MADDPG) in learning speed, policy stability, and flight safety. This research offers a more efficient solution for multi-agent path planning in 3D environments, holding significant theoretical value and practical application potential.
文章引用:张华东, 王友鑫, 王于婷, 徐衍亮, 侯恩广. 基于分层策略与世界模型的多智能体深度确定性策略梯度算法[J]. 计算机科学与应用, 2026, 16(1): 102-114. https://doi.org/10.12677/csa.2026.161009

参考文献

[1] Mayand, V.C., Nugraha, Y.E. and Alkaff, A. (2024) Three-Dimensional Coordination Control of Multi-UAV for Partially Observable Multi-Target Tracking. Journal of Robotics and Control (JRC), 5, 1227-1240.
[2] Hu, R., Li, Y., Xu, C. and Li, Y. (2024) Analysis of Model and Simulation for UAVs Equipment Swarm Attack-Defense Tactics Based on Non-Static Bayesian Architecture. 2024 International Conference on Electronics and Devices, Computational Science (ICEDCS), Marseille, 23-25 September 2024, 706-712. [Google Scholar] [CrossRef
[3] Yanmaz, E., Balanji, H.M. and Güven, İ. (2024) Dynamic Multi-UAV Path Planning for Multi-Target Search and Connectivity. IEEE Transactions on Vehicular Technology, 73, 10516-10528. [Google Scholar] [CrossRef
[4] 陈群, 李超. 城市物流末端卡车-无人机协同运输研究综述[J]. 长沙理工大学学报(自然科学版), 2025, 22(4): 104-115.
[5] 宁聪, 范菁, 孙书魁. 多无人机协同规划研究综述[J]. 计算机工程与应用, 2025, 61(1): 42-58.
[6] Kelner, J.M., Burzynski, W. and Stecz, W. (2024) Modeling UAV Swarm Flight Trajectories Using Rapidly-Exploring Random Tree Algorithm. Journal of King Saud University-Computer and Information Sciences, 36, Article 101909. [Google Scholar] [CrossRef
[7] 曹晓意, 罗煦琼, 李景, 等. 改进人工势场法下的多无人机编队路径规划方法[J]. 计算机应用, 2025, 45(S1): 183-187.
[8] Elmokadem, T. and Savkin, A. (2021) Computationally-Efficient Distributed Algorithms of Navigation of Teams of Autonomous UAVs for 3D Coverage and Flocking. Drones, 5, Article 124. [Google Scholar] [CrossRef
[9] Zhang, R., Lu, R., Cheng, X., Wang, N. and Yang, L. (2021) A UAV-Enabled Data Dissemination Protocol with Proactive Caching and File Sharing in V2X Networks. IEEE Transactions on Communications, 69, 3930-3942. [Google Scholar] [CrossRef
[10] Hou, K., Yang, Y., Yang, X. and Lai, J. (2021) Distributed Cooperative Search Algorithm with Task Assignment and Receding Horizon Predictive Control for Multiple Unmanned Aerial Vehicles. IEEE Access, 9, 6122-6136. [Google Scholar] [CrossRef
[11] 杨浅舒, 阮迪望, 吴先宇, 等. 多智能体强化学习在飞行器协同控制中的研究进展[J]. 战术导弹技术, 2025(4): 90-106.
[12] 唐峯竹, 唐欣, 李春海, 等. 基于深度强化学习的多无人机任务动态分配[J]. 广西师范大学学报(自然科学版), 2021, 39(6): 63-71.
[13] 周彬, 郭艳, 李宁, 等. 基于导向强化Q学习的无人机路径规划[J]. 航空学报, 2021, 42(9): 506-513.
[14] 任君凯, 张洪川, 瞿宇珂, 等. 基于世界模型强化学习的机器人运动控制方法[J/OL]. 机器人, 1-15. 2025-10-12.[CrossRef
[15] 李波, 黄晶益, 万开方, 等. 基于深度强化学习的无人机系统应用研究综述[J]. 战术导弹技术, 2023(1): 58-68.
[16] Zeng, Y., Xu, X., Jin, S. and Zhang, R. (2021) Simultaneous Navigation and Radio Mapping for Cellular-Connected UAV with Deep Reinforcement Learning. IEEE Transactions on Wireless Communications, 20, 4205-4220. [Google Scholar] [CrossRef
[17] 张天浩, 池晴佳, 林永水, 等. 基于人工势场法改进MADDPG算法的AUV协同应召搜潜航路规划研究[J/OL]. 中国舰船研究, 1-12. 2025-10-12.[CrossRef
[18] 王娜, 马利民, 姜云春, 等. 基于多Agent深度强化学习的无人机协作规划方法[J]. 计算机应用与软件, 2024, 41(9): 83-89+96.
[19] Yan, Y., Wang, H. and Chen, X. (2020) Collaborative Path Planning Based on MAXQ Hierarchical Reinforcement Learning for Manned/Unmanned Aerial Vehicles. 2020 39th Chinese Control Conference (CCC), Shenyang, 27-29 July 2020, 4837-4842. [Google Scholar] [CrossRef