随机线性二次控制的资格迹方法

doi:10.12677/PM.2024.141041

期刊菜单

随机线性二次控制的资格迹方法
Eligibility Trace Method for Stochastic Linear Quadratic Control

DOI: 10.12677/PM.2024.141041, PDF,
作者: 朱亚楠：上海理工大学，理学院，上海
关键词: 线性二次最优控制；梯度下降；资格迹；Linear Quadratic Optimal Control； Gradient Descent； Eligibility Traces

摘要: 本文研究了强化学习方法在线性二次控制问题(LQR)中的应用。在LQR问题的研究中，常见的方法通过求解代数黎卡提方程得到最优控制，并不直接优化控制增益。本文在策略梯度算法的基础上引入资格迹方法，直接优化控制增益矩阵。考虑已知和未知参数两种情况下，资格迹方法的收敛。在有限时域和高斯噪声的条件下，分别给出了已知和未知参数两种情况下算法的全局收敛保证。参数未知时，利用零阶优化定理近似梯度项，这可以将问题扩展至代价函数非凸的情况。数值模拟结果显示资格迹方法与梯度下降算法相比更快收敛，方差更小。

Abstract: This paper studies the application of reinforcement learning method to linear quadratic regulator (LQR) problem. For the study of LQR problem, the usual method is to obtain the optimal control by solving the algebraic Riccati equation, but not to optimize the control gain directly. This paper op-timizes the control gain directly, proposes the eligibility trace method on the basis of gradient de-scent algorithm, and produces global convergence guarantee in the case of known and unknown parameters, in the setting of finite time horizon and Gaussian noise. When the parameters are unknown, the zero-order optimization theorem is used to approximate the gradient term, which can extend the problem to cases where the cost function is not convex. Numerical simulation results show that the eligibility trace method has faster convergence and smaller variance than gradient descent algorithm.

文章引用：朱亚楠. 随机线性二次控制的资格迹方法[J]. 理论数学, 2024, 14(1): 416-432. https://doi.org/10.12677/PM.2024.141041

参考文献

[1]	Birge, J. and Louveaux, F. (2011) Introduction to Stochastic Programming. Springer Science & Business Media, Heidelberg. [Google Scholar] [CrossRef]
[2]	Kučera, V. (1992) Optimal Control: Linear Quadratic Methods: Brian D. O. Anderson and John B. Moore. Automatica, 28, 1068-1069. [Google Scholar] [CrossRef]
[3]	Sutton, R.S. and Barto, A.G. (2018) Reinforcement Learning: An Introduction. 2nd ed., the MIT Press, Cambridge.
[4]	Basei, M., Guo, X., Hu, A. and Zhang, Y. (2020) Logarithmic Regret for Episodic Continuous-Time Linear-Quadratic Reinforcement Learning over a Finite-Time Horizon. Computation Theory eJournal. [Google Scholar] [CrossRef]
[5]	Dean, S., Mania, H., Matni, N., Recht, B. and Tu, S. (2017) On the Sample Com-plexity of the Linear Quadratic Regulator. Foundations of Computational Mathematics, 20, 633-679. [Google Scholar] [CrossRef]
[6]	Ren, Z., Zhong, A. and Li, N. (2021) LQR with Tracking: A Zeroth-Order Approach and Its Global Convergence. 2021 American Control Conference (ACC), New Orleans, LA, 25-28 May 2021, 2562-2568. [Google Scholar] [CrossRef]
[7]	Bertsekas, D.P. (2011) Approximate Policy Iteration: A Survey and Some New Methods. Journal of Control Theory and Applications, 9, 310-335. [Google Scholar] [CrossRef]
[8]	Mania, H., Guy, A. and Recht, B. (2018) Simple Random Search Provides a Competitive Approach to Reinforcement Learning. arXiv preprint arXiv:1803.07055
[9]	Abbasi-Yadkori, Y., Lazic, N. and Szepesvari, C. (2019) Model-Free Linear Quadratic Control via Reduction to Expert Prediction. The 22nd International Conference on Artificial Intelligence and Statistics, Naha, 16-18 April 2019, 3108-3117.
[10]	Mahdi, I. and Braga-Neto, U.M. (2018) Finite-Horizon lqr Controller for Partially-Observed Boolean Dynamical Systems. Automatica, 95, 172-179. [Google Scholar] [CrossRef]
[11]	Zhang, H. and Li, N. (2022) Data-Driven Policy Iteration Algorithm for Continuous-Time Stochastic Linear-Quadratic Optimal Control Problems. Asian Journal of Control, 26, 481-489. [Google Scholar] [CrossRef]
[12]	Farjadnasab, M. and Babazadeh, M. (2022) Model-Free LQR Design by Q-Function Learning. Automatica, 137, Article ID: 110060. [Google Scholar] [CrossRef]
[13]	Yaghmaie, F.A., Gustafsson, F.K. and Ljung, L. (2023) Linear Quadratic Control Using Model-Free Reinforcement Learning. IEEE Transactions on Automatic Control, 68, 737-752. [Google Scholar] [CrossRef]
[14]	Tu, S. and Recht, B. (2019) The Gap between Model-Based and Model-Free Methods on the Linear Quadratic Regulator: An Asymptotic Viewpoint. Conference on Learning Theory, USA, 9 December 2019, 3036-3083.
[15]	Malik, D., Pananjady, A., Bhatia, K., Khamaru, K., Bartlett, P.L. and Wainwright, M.J. (2018) Derivative-Free Methods for Policy Optimization: Guarantees for Linear Quadratic Systems. Journal of Machine Learning Research, 21, 1-21.
[16]	Fazel, M., Ge, R. Kakade, S.M. and Mesbahi, M. (2018) Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator. International Conference on Machine Learning, Stockholm, 10-15 July 2018, 1467-1476.
[17]	Hambly, B.M., Xu, R. and Yang, H. (2021) Policy Gradient Methods for the Noisy Linear Quadratic Regulator over a Finite Horizon. SIAM Journal on Control and Optimization, 59, 3359-3391. [Google Scholar] [CrossRef]
[18]	Shamir, O. (2017) An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback. The Journal of Machine Learning Research, 18, 1703-1713.
[19]	Bu, J., Mesbahi, A. and Mesbahi, M. (2020) Policy Gradient-Based Algorithms for Continuous-Time Linear Quadratic Control. arXiv: 2006.09178.
[20]	Bertsekas, D.P. (1995) Dynamic Programming and Optimal Control. 3rd Edition, Athena Scientific, Nashua, NH.

为你推荐

友情链接