基于奇异摄动强化学习的时变系统线性二次零和博弈研究
Singular Perturbation-Based Reinforcement Learning for Time-Varying Linear Quadratic Zero-Sum Games
摘要: 本研究探讨了时变系统中的线性二次零和博弈问题,与以往依赖系统模型的方法有所不同。本文提出了一种无模型的强化学习算法,用于寻找纳什均衡解。首先,通过奇异摄动理论,将时变动态博弈问题转化为两个定常系统的博弈问题。接着,利用无模型的强化学习算法,确定这两个定常系统的纳什均衡,进而近似求解了时变系统的纳什均衡解。本文提出的算法框架将为处理基于强化学习的时变系统鲁棒控制问题或信息物理系统的弹性控制问题提供新的研究思路。
Abstract: This paper tackles the challenge of linear quadratic zero-sum games within dynamic systems that evolve over time. In contrast to previous methods that heavily rely on system models, this paper introduces a novel model-free reinforcement learning algorithm to determine Nash equilibrium solutions. To begin, the paper employs the singular perturbation theory to transform the time- varying dynamic game problem into two separate time-invariant dynamic game problems. Then, by leveraging a model-free reinforcement learning algorithm, it identifies Nash equilibria for these two time-invariant systems, effectively approximating the Nash equilibrium solution for the original time-varying system. The algorithm framework proposed in this paper introduces a fresh perspective for addressing robust control problems in dynamic systems with time variations. Additionally, it opens up new possibilities for robust control problems in time-varying systems or achieving resilient control in cyber-physical systems by harnessing the power of reinforcement learning.
文章引用:刘明相. 基于奇异摄动强化学习的时变系统线性二次零和博弈研究[J]. 人工智能与机器人研究, 2023, 12(4): 373-382. https://doi.org/10.12677/AIRR.2023.124040

参考文献

[1] aşar, T. and Olsder, G.J. (1998) Dynamic Noncooperative Game Theory. Society for Industrial and Applied Mathematics, Philadelphia. [Google Scholar] [CrossRef
[2] Ho, Y., Bryson, A. and Baron, S. (1965) Differential Games and Optimal Pursuit-Evasion Strategies. IEEE Transactions on Automatic Control, 10, 385-389. [Google Scholar] [CrossRef
[3] Başar, T. and Bernhard, P. (2008) H∞-Optimal Control and Related Minimax Design Problems: A Dynamic Game Approach. Birkhäuser, Boston. [Google Scholar] [CrossRef
[4] Dow, J. and Werlang, S.R.D.C. (1994) Nash Equilibrium under Knightian Uncertainty: Breaking down Backward Induction. Journal of Economic Theory, 64, 305-324. [Google Scholar] [CrossRef
[5] Kleinman, D. (1968) On an Iterative Technique for Riccati Equation Computations. IEEE Transactions on Automatic Control, 13, 114-115. [Google Scholar] [CrossRef
[6] Feng, Y., Anderson, B.D. and Rotkowitz, M. (2009) A Game Theoretic Algorithm to Compute Local Stabilizing Solutions to HJBI Equations in Nonlinear H∞ Control. Automatica, 45, 881-888. [Google Scholar] [CrossRef
[7] Vamvoudakis, K.G. and Lewis, F.L. (2012) Online Solution of Nonlinear Two-Player Zero-Sum Games Using Synchronous Policy Iteration. International Journal of Robust and Nonlinear Control, 22, 1460-1483. [Google Scholar] [CrossRef
[8] Van Der Schaft, A.J. (1992) L/Sub 2/-Gain Analysis of Nonlinear Systems and Nonlinear State-Feedback H/Sub Infinity/Control. IEEE Transactions on Automatic Control, 37, 770-784. [Google Scholar] [CrossRef
[9] Abu-Khalaf, M., Lewis, F.L. and Huang, J. (2006) Policy Iterations on the Hamilton-Jacobi-Isaacs Equation for H∞ State Feedback Control with Input Saturation. IEEE Transactions on Automatic Control, 51, 1989-1995. [Google Scholar] [CrossRef
[10] Szmuk, M. and Acikmese, B. (2018) Successive Convexification for 6-DoF Mars Rocket Powered Landing with Free-Final-Time. 2018 AIAA Guidance, Navigation, and Control Conference, Kissimmee, 8-12 January 2018, 617-630. [Google Scholar] [CrossRef
[11] Mahdavi, J., Emaadi, A., Bellar, M.D. and Ehsani, M. (1997) Analysis of Power Electronic Converters Using the Generalized State-Space Averaging Approach. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 44, 767-770. [Google Scholar] [CrossRef
[12] Sutton, R.S. and Barto, A.G. (2018) Reinforcement Learning: An Introduction. MIT Press, Cambridge.
[13] Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Hassabis, D., et al. (2016) Mastering the Game of Go with Deep Neural Networks and tree Search. Nature, 529, 484-489. [Google Scholar] [CrossRef] [PubMed]
[14] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hassabis, D., et al. (2017) Mastering the Game of Go without Human Knowledge. Nature, 550, 354-359. [Google Scholar] [CrossRef] [PubMed]
[15] Vinyals, O., Babuschkin, I., Chung, J., Mathieu, M., Jaderberg, M., Czarnecki, W.M. and Ewalds, T. (2019) AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. DeepMind Blog.
https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/
[16] Vrabie, D., Pastravanu, O., Abu-Khalaf, M. and Lewis, F.L. (2009) Adaptive Optimal Control for Continuous-Time Linear Systems Based on Policy Iteration. Automatica, 45, 477-484. [Google Scholar] [CrossRef
[17] Zhang, H., Luo, Y. and Liu, D. (2009) Neural-Network-Based Near-Optimal Control for a Class of Discrete-Time Affine Nonlinear Systems with Control Constraints. IEEE Transactions on Neural Networks, 20, 1490-1503. [Google Scholar] [CrossRef
[18] Jiang, Y. and Jiang, Z.P. (2012) Computational Adaptive Optimal Control for Continuous-Time Linear Systems with Completely Unknown Dynamics. Automatica, 48, 2699-2704. [Google Scholar] [CrossRef
[19] Jiang, Y., Shi, D., Fan, J., Chai, T. and Chen, T. (2022) Event-Triggered Model Reference Adaptive Control for Linear Partially Time-Variant Continuous-Time Systems with Nonlinear Parametric Uncertainty. IEEE Transactions on Automatic Control, 68, 1878-1885. [Google Scholar] [CrossRef
[20] Al-Tamimi, A., Lewis, F.L. and Abu-Khalaf, M. (2007) Model-Free Q-Learning Designs for Linear Discrete-Time Zero-Sum Games with Application to H-Infinity Control. Automatica, 43, 473-481. [Google Scholar] [CrossRef
[21] Li, H., Liu, D. and Wang, D. (2014) Integral Reinforcement Learning for Linear Continuous-Time Zero-Sum Games with Completely Unknown Dynamics. IEEE Transactions on Automation Science and Engineering, 11, 706-714. [Google Scholar] [CrossRef
[22] Rizvi, S.A.A. and Lin, Z. (2018) Output Feedback Q-Learning for Discrete-Time Linear Zero-Sum Games with Application to the H-Infinity Control. Automatica, 95, 213-221. [Google Scholar] [CrossRef
[23] Rizvi, S.A.A. and Lin, Z. (2020) Output Feedback Adaptive Dynamic Programming for Linear Differential Zero-Sum Games. Automatica, 122, Article ID: 109272. [Google Scholar] [CrossRef
[24] Pang, B., Bian, T. and Jiang, Z.P. (2019) Adaptive Dynamic Programming for Finite-Horizon Optimal Control of Linear Time-Varying Discrete-Time Systems. Control Theory and Technology, 17, 73-84. [Google Scholar] [CrossRef
[25] Pang, B., Jiang, Z.P. and Mareels, I. (2020) Reinforcement Learning for Adaptive Optimal Control of Continuous-Time Linear Periodic Systems. Automatica, 118, Article ID: 109035. [Google Scholar] [CrossRef
[26] Pang, B. and Jiang, Z.P. (2020) Adaptive Optimal Control of Linear Periodic Systems: An Off-Policy Value Iteration Approach. IEEE Transactions on Automatic Control, 66, 888-894. [Google Scholar] [CrossRef
[27] Reddy, V., Eldardiry, H. and Boker, A. (2022) Singular Perturbation-Based Reinforcement Learning of Two-Point Boundary Optimal Control Systems. 2022 American Control Conference (ACC), Atlanta, 8-10 June 2022, 3323-3328. [Google Scholar] [CrossRef
[28] Wilde, R. and Kokotovic, P. (1972) A Dichotomy in Linear Control Theory. IEEE Transactions on Automatic control, 17, 382-383. [Google Scholar] [CrossRef
[29] Jiang, Y. and Jiang, Z.P. (2012) Robust Adaptive Dynamic Programming. In: Lewis, F.L. and Liu, D., Eds., Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, Wiley-IEEE Press, New York, 281-302. [Google Scholar] [CrossRef
[30] Lewis, F.L., Vrabie, D. and Syrmos, V.L. (2012) Optimal Control. John Wiley & Sons, New York. [Google Scholar] [CrossRef
[31] Kokotović, P., Khalil, H.K. and O’reilly, J. (1999) Singular Perturbation Methods in Control: Analysis and Design. Society for Industrial and Applied Mathematics, Philadelphia. [Google Scholar] [CrossRef