基于去噪概率扩散模型的平均场多智能体强化学习算法
Mean Field Multi-Agent Reinforcement Learning Algorithm Based on Denoising Diffusion Probability Models
DOI: 10.12677/sea.2024.135072, PDF,    国家自然科学基金支持
作者: 单国强:南京邮电大学波特兰学院,江苏 南京;缪霏阳, 张子胤, 李大鹏:南京邮电大学通信与信息工程学院,江苏 南京
关键词: 多智能体强化学习去噪概率扩散模型平均场控制策略学习Multi-Agent Reinforcement Learning DDPM Mean-Field Control Policy Learning
摘要: 为了解决基于平均场的多智能体强化学习(M3-UCRL)算法中的环境动力学模型对下一时刻状态预测不精确和策略学习样本过少的问题。本文利用了去噪概率扩散模型(Denoising Diffusion Probabilistic Models, DDPM)的数据生成能力,提出了一种基于DDPM的平均场多智能体强化学习(DDPM-M3RL)算法。该算法将环境模型的生成表述为去噪问题,利用DDPM算法,提高了环境模型对下一时刻状态预测的精确度,也为后续的策略学习提供了充足的样本数据,提高了策略模型的收敛速度。实验结果表明,该算法可以有效提高环境动力学模型对下一时刻状态预测的精确度,根据环境动力学模型生成的状态转移数据可以为策略学习提供充足的学习样本,有效提高了导航策略的性能和稳定性。
Abstract: To solve the problems of inaccurate prediction of the next state by the environment dynamics model and too few samples for policy learning in the mean field based multi-agent reinforcement learning (M3-UCRL) algorithm, this paper takes advantage of the data generation capability of denoising diffusion probability models (DDPM) and proposes a mean field multi-agent reinforcement learning (DDPM-M3RL) algorithm based on DDPM. The algorithm formulates the generation of the environment model as a denoising problem. By using the DDPM algorithm, the accuracy of the environment model’s prediction of the next state is improved, and sufficient sample data is provided for subsequent policy learning, which improves the convergence speed of the policy model. Experimental results show that the algorithm can effectively improve the accuracy of the environment dynamics model’s prediction of the next state, and the state transition data generated by the environment dynamics model can provide sufficient learning samples for policy learning, which effectively improves the performance and stability of the navigation strategy.
文章引用:单国强, 缪霏阳, 张子胤, 李大鹏. 基于去噪概率扩散模型的平均场多智能体强化学习算法[J]. 软件工程与应用, 2024, 13(5): 704-719. https://doi.org/10.12677/sea.2024.135072

参考文献

[1] Bin, W., Kerong, B., Yixue, H. and Mingjiu, Z. (2024) SQMCR: Stackelberg Q-Learning-Based Multi-Hop Cooperative Routing Algorithm for Underwater Wireless Sensor Networks. IEEE Access, 12, 56179-56195. [Google Scholar] [CrossRef
[2] Shi, D., Li, L., Ohtsuki, T., Pan, M., Han, Z. and Poor, H.V. (2022) Make Smart Decisions Faster: Deciding D2D Resource Allocation via Stackelberg Game Guided Multi-Agent Deep Reinforcement Learning. IEEE Transactions on Mobile Computing, 21, 4426-4438. [Google Scholar] [CrossRef
[3] Zhou, Z. and Xu, H. (2021) Decentralized Optimal Multi-Agent System Tracking Control Using Mean Field Games with Heterogeneous Agent. 2021 IEEE Conference on Control Technology and Applications (CCTA), San Diego, 9-11 August 2021, 97-102. [Google Scholar] [CrossRef
[4] Hernandez-Leal, P., Kartal, B. and Taylor, M.E. (2019) A Survey and Critique of Multiagent Deep Reinforcement Learning. Autonomous Agents and Multi-Agent Systems, 33, 750-797. [Google Scholar] [CrossRef
[5] Gu, H., Guo, X., Wei, X. and Xu, R. (2021) Mean-Field Controls with Q-Learning for Cooperative MARL: Convergence and Complexity Analysis. SIAM Journal on Mathematics of Data Science, 3, 1168-1196. [Google Scholar] [CrossRef
[6] Huang, M., Caines, P.E. and Malhame, R.P. (2007) Large-Population Cost-Coupled LQG Problems with Nonuniform Agents: Individual-Mass Behavior and Decentralized ε-Nash Equilibria. IEEE Transactions on Automatic Control, 52, 1560-1571. [Google Scholar] [CrossRef
[7] Wang, T., Bao, X., Clavera, I., et al. (2019) Benchmarking Model-Based Reinforcement Learning. arXiv: 1907.02057.
[8] Pasztor, B., Bogunovic, I. and Krause, A. (2021) Efficient Model-Based Multi-Agent Mean-Field Reinforcement Learning. arXiv: 2107.04050.
[9] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., et al. (2015) Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, 6-11 July 2015, 2256-2265.
[10] Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S. and Poole, B. (2021) Score-Based Generative Modeling through Stochastic Differential Equations. arXiv: 2011.13456. [Google Scholar] [CrossRef
[11] Kingma, D.P. and Welling, M. (2013) Auto-Encoding Variational Bayes. arXiv: 1312.6114. [Google Scholar] [CrossRef
[12] Goodfellow, I., Abadie, J.P., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014) Generative Adversarial Nets. Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, 8-13 December 2014, 2672-2680.
[13] Ho, J., Jain, A. and Abbeel, P. (2020) Denoising Diffusion Probabilistic Models. arXiv: 2006.11239. [Google Scholar] [CrossRef
[14] Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R. and Van Gool, L. (2022) Repaint: Inpainting Using Denoising Diffusion Probabilistic Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 11451-11461. [Google Scholar] [CrossRef
[15] Austin, J., Johnson, D.D., Ho, J., Tarlow, D. and Berg, R.V.D. (2021) Structured Denoising Diffusion Models in Discrete Statespaces. Advances in Neural Information Processing Systems, 34, 17981-17993.
[16] Lee, J. and Han, S. (2021) Nuwave: A Diffusion Probabilistic Model for Neural Audio Upsampling. arXiv: 2104.02321. [Google Scholar] [CrossRef
[17] Kong, Z., Ping, W., Huang, J., Zhao, K. and Catanzaro, B. (2020) Diffwave: A Versatile Diffusion Model for Audio Synthesis. arXiv: 2009.09761. [Google Scholar] [CrossRef
[18] Dhariwal, P. and Nichol, A. (2021) Diffusion Models Beat GANs on Image Synthesis. Advances in Neural Information Processing Systems, 34, 8780-8794.
[19] Ho, J. and Salimans, T. (2022) Classifier-Free Diffusion Guidance. arXiv: 2207.12598. [Google Scholar] [CrossRef
[20] Sutton, R.S. (1991) Dyna, an Integrated Architecture for Learning, Planning, and Reacting. ACM SIGART Bulletin, 2, 160-163. [Google Scholar] [CrossRef