基于值分布的多智能体强化学习方法
Multi-Agent Reinforcement Learning Method Based on Value Distribution
摘要: 近年来,多智能体强化学习随着深度学习技术的发展和算法研究的深入,成为人工智能领域的研究热点。特别是在处理复杂的决策问题和环境中,多智能体系统展现出其独特的优势。本文介绍了一种基于值分布的多智能体强化学习算法,旨在通过改进算法结构和学习机制,提升多智能体协作中的性能和稳定性。首先,本文深入分析了强化学习中的值分布概念,并探讨了其在多智能体系统中的应用挑战和潜在价值。随后,提出了CvM-MIX算法,该算法通过结合值分布强化学习和值分解技术,有效地提高了对环境随机性的适应能力,并采用了一种改进的基于权重优先级的经验回放机制,进一步优化了学习过程。通过在星际争霸II多智能体挑战赛(SMAC)平台进行的一系列实验,验证了CvM-MIX算法相较于传统算法在性能和稳定性上的优势。实验结果显示,CvM-MIX算法在多种对抗模式下均表现出更快的收敛速度和更高的胜率,尤其是在复杂场景中的表现尤为突出。
Abstract: In recent years, multi-agent reinforcement learning has become a research hotspot in the field of artificial intelligence with the development of deep learning technology and the deepening of algorithm research. Especially in dealing with complex decision-making problems and environments, multi-agent systems demonstrate their unique advantages. This article introduces a multi-agent reinforcement learning algorithm based on value distribution, aiming to improve the performance and stability of multi-agent collaboration by improving the algorithm structure and learning mechanism. Firstly, this article provides an in-depth analysis of the concept of value distribution in reinforcement learning, and explores its application challenges and potential value in multi-agent systems. Subsequently, the CvM MIX algorithm was proposed, which effectively improved its adaptability to environmental randomness by combining value distribution reinforcement learning and value decomposition techniques. An improved weight priority based experience replay mechanism was adopted to further optimize the learning process. Through a series of experiments conducted on the StarCraft II Multi Agent Challenge (SMAC) platform, the performance and stability advantages of the CvM MIX algorithm compared to traditional algorithms were verified. The experimental results show that the CvM MIX algorithm exhibits faster convergence speed and higher win rate in various adversarial modes, especially in complex scenes.
文章引用:韩明志, 李宁, 王超. 基于值分布的多智能体强化学习方法[J]. 计算机科学与应用, 2024, 14(4): 201-212. https://doi.org/10.12677/csa.2024.144090

参考文献

[1] Li, Y. (2017) Deep Reinforcement Learning: An Overview. arXiv preprint arXiv:1701.07274.
[2] LeCun, Y., Bengio, Y. and Hinton, G. (2015) Deep Learning. Nature, 521, 436-444. [Google Scholar] [CrossRef] [PubMed]
[3] Kaelbling, L.P., Littman, M.L. and Moore, A.W. (1996) Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 4, 237-285. [Google Scholar] [CrossRef
[4] Barto, A.G., Sutton, R.S. and Watkins, C. (1989) Learning and Sequential Decision Making. University of Massachusetts, Amherst.
[5] Fedus, W., Ghosh, D., Martin, J.D., et al. (2020) On Catastrophic Interference in Atari 2600 Games. arXiv preprint arXiv:2002.12499.
[6] Conrad, S., Teichmann, J., Auth, P., et al. (2024) 3D-Printed Digital Pneumatic Logic for the Control of Soft Robotic Actuators. Science Robotics, 9, eadh4060. [Google Scholar] [CrossRef] [PubMed]
[7] Brown, N. and Sandholm, T. (2018) Superhuman AI for Heads-up No-Limit Poker: Libratus Beats Top Professionals. Science, 359, 418-424. [Google Scholar] [CrossRef] [PubMed]
[8] Brown, N. and Sandholm, T. (2019) Superhuman AI for Multiplayer Poker. Science, 365, 885-890. [Google Scholar] [CrossRef] [PubMed]
[9] Da Silva, F.L. and Costa, A.H.R. (2019) A Survey on Transfer Learning for Multiagent Reinforcement Learning Systems. Journal of Artificial Intelligence Research, 64, 645-703. [Google Scholar] [CrossRef
[10] Bellemare, M.G., Dabney, W. and Munos, R. (2017) A Distributional Perspective on Reinforcement Learning. Proceedings of the 34th International Conference on Machine Learning, Sydney, 6-11 August 2017, 449-458.
[11] Sun, W.F., Lee, C.K. and Lee, C.Y. (2021) DFAC Framework: Factorizing the Value Function via Quantile Mixture for Multi-Agent Distributional Q-Learning. Proceedings of the 38th International Conference on Machine Learning, 18-24 July 2021, 9945-9954.
[12] Hong, Y., Jin, Y. and Tang, Y. (2022) Rethinking Individual Global Max in Cooperative Multi-Agent Reinforcement Learning. Advances in Neural Information Processing Systems, 35, 32438-32449.
[13] Zhao, J., Yang, M., Zhao, Y., et al. (2023) MCMARL: Parameterizing Value Function via Mixture of Categorical Distributions for Multi-Agent Reinforcement Learning. IEEE Transactions on Games, 1-10. [Google Scholar] [CrossRef
[14] Kappen, H.J. (2011) Optimal Control Theory and the Linear Bellman Equation. In: Barber, D., Cemgil, A.T. and Chiappa, S., Eds., Bayesian Time Series Models, Cambridge University Press, Cambridge, 363-387. [Google Scholar] [CrossRef
[15] Filar, J. and Vrieze, K. (2012) Competitive Markov Decision Processes. Springer Science & Business Media, Berlin.
[16] Guicheng, S. and Yang, W. (2022) Review on Dec-POMDP Model for Marl Algorithms. In: Jain, L.C., Kountchev, R., Hu, B. and Kountcheva, R., Eds., Smart Communications, Smart Communications, Intelligent Algorithms and Interactive Methods, Springer, Singapore, 29-35. [Google Scholar] [CrossRef
[17] Zhou, Y., Liu, S., Qing, Y., et al. (2023) Is Centralized Training with Decentralized Execution Framework Centralized Enough for MARL? arXiv preprint arXiv:2305.17352.
[18] Lowe, R., Wu, Y.I., Tamar, A., et al. (2017) Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 6382-6393.
[19] Sunehag, P., Lever, G., Gruslys, A., et al. (2017) Value-Decomposition Networks for Cooperative Multi-Agent Learning. arXiv preprint arXiv:1706.05296.
[20] Rashid, T., Samvelyan, M., De Witt, C.S., et al. (2020) Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. The Journal of Machine Learning Research, 21, 7234-7284.
[21] Yang, Y., Hao, J., Liao, B., et al. (2020) Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning. arXiv preprint arXiv:2002.03939.
[22] Hu, J., Harding, S.A., Wu, H., et al. (2020) QR-MIX: Distributional Value Function Factorisation for Cooperative Multi-Agent Reinforcement Learning. arXiv preprint arXiv:2009.04197.
[23] Qiu, W., Wang, X., Yu, R., et al. (2021) RMIX: Learning Risk-Sensitive Policies for Cooperative Reinforcement Learning Agents. Advances in Neural Information Processing Systems, 34, 23049-23062.
[24] Darling, D.A. (1957) The Kolmogorov-Smirnov, Cramer-von Mises Tests. The Annals of Mathematical Statistics, 28, 823-838. [Google Scholar] [CrossRef