零和博弈下的策略惯性正则化多智能体Actor-Critic算法
Policy Inertia Regularization Multi-Agent Actor-Critic Algorithm in Zero-Sum Game
摘要: 随着多智能体系统在复杂物理环境中的广泛应用,如何解决多智能体随机博弈过程中的动力学非平稳性与物理安全约束缺失问题,已成为强化学习领域亟待突破的关键挑战。针对传统算法在随机博弈对抗中易陷入策略震荡以及依赖“试错”机制无法保障系统绝对安全的核心缺陷,本文提出了一种基于策略惯性正则化多智能体Actor-Critic算法。针对随机博弈环境下的纳什均衡收敛难题,本文建立了基于策略惯性正则化的博弈动力学模型。通过在优化目标中引入基于欧氏距离的策略锚点约束,重构了多智能体博弈的优化景观。微分动力学谱分析表明,该正则化机制使得系统雅可比矩阵的特征值实部发生负向平移,将原本不稳定的纳什均衡点转化为局部渐近稳定的,有效抑制了随机博弈中的高频梯度抖动与策略循环现象。在“湿滑网格世界”的数值对比实验中,该机制成功抑制了高频梯度抖动与策略退化现象,使智能体在强随机干扰下仍能习得时间最优路径,验证了改进算法在非平稳环境下的鲁棒收敛能力。
Abstract: With the widespread application of multi-agent systems in complex physical environments, addressing the dynamic non-stationarity and lack of physical security constraints in multi-agent stochastic games has become a key challenge in reinforcement learning. To address the core shortcomings of traditional algorithms, such as their susceptibility to policy oscillations in stochastic adversarial games and the inability to guarantee absolute system security through trial-and-error mechanisms, this paper proposes a multi-agent Actor-Critic algorithm based on policy inertia regularization. To solve the Nash equilibrium convergence problem in stochastic game environments, this paper establishes a game dynamics model based on policy inertia regularization. By introducing policy anchor constraints based on Euclidean distance into the optimization objective, the optimization landscape of multi-agent games is reconstructed. Differential dynamics spectrum analysis shows that this regularization mechanism causes a negative shift in the real parts of the eigenvalues of the system’s Jacobian matrix, transforming the originally unstable Nash equilibrium point into a locally asymptotically stable one, effectively suppressing high-frequency gradient jitter and policy looping in stochastic games. In a numerical comparison experiment using a “slippery grid world”, this mechanism successfully suppressed high-frequency gradient jitter and policy degradation, enabling the agent to learn the time-optimal path even under strong random disturbances. This verifies the robust convergence capability of the improved algorithm in non-stationary environments.
文章引用:陈龙曰, 高红伟. 零和博弈下的策略惯性正则化多智能体Actor-Critic算法[J]. 理论数学, 2026, 16(3): 191-204. https://doi.org/10.12677/pm.2026.163082

参考文献

[1] Sutton, R.S. and Barto, A.G. (1998) Reinforcement Learning: An Introduction. Vol. 1, No. 1, MIT Press.
[2] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., et al. (2015) Human-Level Control through Deep Reinforcement Learning. Nature, 518, 529-533. [Google Scholar] [CrossRef] [PubMed]
[3] Shapley, L.S. (1953) Stochastic Games. Proceedings of the National Academy of Sciences, 39, 1095-1100. [Google Scholar] [CrossRef] [PubMed]
[4] Lowe, R., et al. (2017) Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 6382-6393.
[5] Hernandez-Leal, P., et al. (2017) A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity.
[6] Tan, M. (1993) Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents. Proceedings of the 10th International Conference, Amherst, 27-29 June 1993, 330-337. [Google Scholar] [CrossRef
[7] Rashid, T., et al. (2020) Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. Journal of Machine Learning Research, 21, 1-51.
[8] Singh, S., Kearns, M.J. and Mansour, Y. (2000) Nash Convergence of Gradient Dynamics in General-Sum Games. UAI.
[9] Foerster, J.N., et al. (2017) Learning with Opponent-Learning Awareness.
[10] Letcher, A., et al. (2018) Stable Opponent Shaping in Differentiable Games.
[11] Schulman, J., et al. (2015) Trust Region Policy Optimization. ICML’15: Proceedings of the 32nd International Conference on Machine Learning, Volume 37, 1889-1897.
[12] Haarnoja, T., Zhou, A., Abbeel, P. and Levine, S. (2018) Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Proceedings of the 35th International Conference on Machine Learning, Stockholm, 10-15 July 2018, 1861-1870.
[13] Czarnecki, W.M., et al. (2020) Real World Games Look like Spinning Tops. NIPS’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, 6-12 December 2020, 17443-17454.
[14] Littman, M.L. (1994) Markov Games as a Framework for Multi-Agent Reinforcement Learning. Proceedings of the 11th International Conference, Rutgers University, New Brunswick, 10-13 July 1994, 157-163. [Google Scholar] [CrossRef
[15] Ziebart, B.D., et al. (2008) Maximum Entropy Inverse Reinforcement Learning. AAAI, Volume 8, 1433-1438.
[16] Hochreiter, S. and Schmidhuber, J. (1997) Flat Minima. Neural Computation, 9, 1-42. [Google Scholar] [CrossRef] [PubMed]
[17] Goodfellow, I.J., Vinyals, O. and Saxe, A.M. (2014) Qualitatively Characterizing Neural Network Optimization Problems.