#### 期刊菜单

Singular Perturbation-Based Reinforcement Learning for Time-Varying Linear Quadratic Zero-Sum Games
DOI: 10.12677/AIRR.2023.124040, PDF, HTML, XML, 下载: 282  浏览: 382

Abstract: This paper tackles the challenge of linear quadratic zero-sum games within dynamic systems that evolve over time. In contrast to previous methods that heavily rely on system models, this paper introduces a novel model-free reinforcement learning algorithm to determine Nash equilibrium solutions. To begin, the paper employs the singular perturbation theory to transform the time- varying dynamic game problem into two separate time-invariant dynamic game problems. Then, by leveraging a model-free reinforcement learning algorithm, it identifies Nash equilibria for these two time-invariant systems, effectively approximating the Nash equilibrium solution for the original time-varying system. The algorithm framework proposed in this paper introduces a fresh perspective for addressing robust control problems in dynamic systems with time variations. Additionally, it opens up new possibilities for robust control problems in time-varying systems or achieving resilient control in cyber-physical systems by harnessing the power of reinforcement learning.

1. 引言

2. 问题描述

$\frac{\text{d}x}{\text{d}t}=A\left(t\right)x\left(t\right)+B\left(t\right)u\left(t\right)+D\left(t\right)w\left(t\right)$ (1)

$J={\int }_{0}^{T}{x}^{T}\left(t\right)Q\left(t\right)x\left(t\right)+{u}^{T}\left(t\right){R}^{u}\left(t\right)u\left(t\right)-{w}^{T}\left(t\right){R}^{w}\left(t\right)w\left(t\right)dt$ (2)

$J\left(\begin{array}{ccc}x\left(0\right)& {u}^{*}& w\end{array}\right)\le J\left(\begin{array}{ccc}x\left(0\right)& {u}^{*}& {w}^{*}\end{array}\right)\le J\left(\begin{array}{ccc}x\left(0\right)& u& {w}^{*}\end{array}\right)$ (3)

$\begin{array}{c}H={x}^{\text{T}}\left(t\right)Q\left(t\right)x\left(t\right)+{u}^{\text{T}}\left(t\right){R}^{u}\left(t\right)u\left(t\right)-{w}^{\text{T}}\left(t\right){R}^{w}\left(t\right)w\left(t\right)\\ \text{\hspace{0.17em}}\text{ }\text{ }+{\lambda }^{\text{T}}\left(t\right)\left(A\left(t\right)x\left(t\right)+B\left(t\right)u\left(t\right)+D\left(t\right)w\left(t\right)\right)\end{array}$ (4)

$\stackrel{˙}{\lambda }\left(t\right)=-{\nabla }_{x}H=-Q\left(t\right)x\left(t\right)-{A}^{\text{T}}\left(t\right)\lambda \left(t\right)$ (5)

$\left\{\begin{array}{l}u\left(t\right)=-{\left({R}^{u}\left(t\right)\right)}^{-1}B\left(t\right)\lambda \left(t\right)\\ w\left(t\right)={\left({R}^{w}\left(t\right)\right)}^{-1}D\left(t\right)\lambda \left(t\right)\end{array}$ (6)

3. 奇异摄动的设计

$\tau =\frac{t}{T}$ (7)

$\epsilon =\frac{1}{T}$ (8)

$\epsilon \left[\begin{array}{c}\frac{dx}{d\tau }\\ \frac{d\lambda }{d\tau }\end{array}\right]=\left[\begin{array}{cc}A\left(\tau \right)& 0\\ -Q\left(\tau \right)& -{A}^{T}\left(\tau \right)\end{array}\right]\left[\begin{array}{c}x\\ \lambda \end{array}\right]+\left[\begin{array}{c}B\left(\tau \right)\\ 0\end{array}\right]u+\left[\begin{array}{c}D\left(\tau \right)\\ 0\end{array}\right]w$ (9)

$J=T*{\int }_{0}^{1}{x}^{\text{T}}\left(\tau \right)Q\left(\tau \right)x\left(\tau \right)+{u}^{\text{T}}\left(\tau \right){R}^{u}\left(\tau \right)u\left(\tau \right)-{w}^{\text{T}}\left(\tau \right){R}^{w}\left(\tau \right)w\left(\tau \right)\text{d}t$ (10)

$\left\{\begin{array}{l}u\left(\tau \right)=-{\left({R}^{u}\left(\tau \right)\right)}^{-1}B\left(\tau \right)\lambda \left(\tau \right)\\ w\left(\tau \right)={\left({R}^{w}\left(\tau \right)\right)}^{-1}D\left(\tau \right)\lambda \left(\tau \right)\end{array}$ (11)

${H}_{M}=\left[\begin{array}{cc}A\left(\tau \right)& -B{\left({R}^{u}\left(\tau \right)\right)}^{-1}{B}^{\text{T}}\left(\tau \right)-D{\left({R}^{w}\left(\tau \right)\right)}^{-1}{D}^{\text{T}}\left(\tau \right)\\ -Q\left(\tau \right)& -{A}^{\text{T}}\left(\tau \right)\end{array}\right]$ (12)

$\left[\begin{array}{c}x\\ \lambda \end{array}\right]=\left[\begin{array}{cc}I& I\\ {P}_{a}\left(\tau ,\epsilon \right)& {P}_{b}\left(\tau ,\epsilon \right)\end{array}\right]\left[\begin{array}{c}{x}_{a}\\ {x}_{b}\end{array}\right]$ (13)

[引理2.3， [31] ]中表明，在本文的假设下，对于足够小的 $\epsilon$ ，矩阵(12)是非奇异的。因此，结合(12)，系统(9)可以转换为奇异摄动系统如下

$\epsilon \frac{\text{d}{x}_{a}}{\text{d}\tau }=A\left(\tau \right){x}_{a}+B\left(\tau \right){u}_{a}+D\left(\tau \right){w}_{a}$ (14)

$\epsilon \frac{\text{d}{x}_{b}}{\text{d}\tau }=A\left(\tau \right){x}_{b}+B\left(\tau \right){u}_{b}+D\left(\tau \right){w}_{b}$ (15)

${u}_{a}=-{\left({R}^{u}\left(\tau \right)\right)}^{-1}B\left(\tau \right){P}_{a}\left(\tau ,\epsilon \right){x}_{a}$ (16)

${u}_{b}=-{\left({R}^{u}\left(\tau \right)\right)}^{-1}B\left(\tau \right){P}_{b}\left(\tau ,\epsilon \right){x}_{b}$ (17)

${w}_{a}={\left({R}^{w}\left(\tau \right)\right)}^{-1}D\left(\tau \right){P}_{a}\left(\tau ,\epsilon \right){x}_{a}$ (18)

${w}_{b}={\left({R}^{w}\left(\tau \right)\right)}^{-1}D\left(\tau \right){P}_{b}\left(\tau ,\epsilon \right){x}_{b}$ (19)

$\epsilon \stackrel{˙}{P}=-{A}^{\text{T}}\left(\tau \right)P-PA\left(\tau \right)+Q\left(\tau \right)+P\left(B\left(\tau \right){\left({R}^{u}\left(\tau \right)\right)}^{-1}{B}^{\text{T}}\left(\tau \right)-D\left(\tau \right){\left({R}^{w}\left(\tau \right)\right)}^{-1}{D}^{\text{T}}\left(\tau \right)\right)P$ (20)

$\gamma =\frac{\tau }{\epsilon },\beta =\frac{1-\tau }{\epsilon }$ (21)

$\frac{\text{d}{x}_{a}}{\text{d}\gamma }=A\left(0\right){x}_{a}+B\left(0\right){u}_{a}+D\left(0\right){w}_{a}$ (22)

${u}_{a}\left(\gamma \right)={K}_{a}{x}_{a}=-{\left({R}^{u}\left(0\right)\right)}^{-1}{B}^{\text{T}}\left(0\right){P}_{a}\left(0\right){x}_{a}\left(\gamma \right)$ (23)

${w}_{a}\left(\gamma \right)={L}_{a}{x}_{a}={\left({R}^{w}\left(0\right)\right)}^{-1}{D}^{\text{T}}\left(0\right){P}_{a}\left(0\right){x}_{a}\left(\gamma \right)$ (24)

$J\left({x}_{a},{u}_{a},{w}_{a}\right)={\int }_{0}^{\infty }{x}_{a}^{\text{T}}Q\left(0\right){x}_{a}+{u}_{a}^{\text{T}}{R}^{u}\left(0\right){u}_{a}-{w}_{a}^{\text{T}}{R}^{w}\left(0\right){w}_{a}\text{d}\gamma$ (25)

$\frac{\text{d}{x}_{b}}{\text{d}\beta }=A\left(1\right){x}_{b}+B\left(1\right){u}_{b}+D\left(1\right){w}_{b}$ (26)

${u}_{b}\left(\beta \right)={K}_{b}{x}_{b}=-{\left({R}^{u}\left(1\right)\right)}^{-1}{B}^{\text{T}}\left(1\right){P}_{b}\left(1\right){x}_{b}\left(\beta \right)$ (27)

${w}_{b}\left(\beta \right)={L}_{b}{x}_{b}={\left({R}^{w}\left(1\right)\right)}^{-1}{D}^{\text{T}}\left(1\right){P}_{b}\left(1\right){x}_{b}\left(\beta \right)$ (28)

$J\left({x}_{b},{u}_{b},{w}_{b}\right)={\int }_{0}^{\infty }{x}_{b}^{\text{T}}Q\left(1\right){x}_{b}+{u}_{b}^{\text{T}}{R}^{u}\left(1\right){u}_{b}-{w}_{b}^{\text{T}}{R}^{w}\left(1\right){w}_{b}\text{d}\beta$ (29)

$x\left(\tau \right)={x}_{a}\left(\gamma \right)+{x}_{b}\left(\beta \right)+Ο\left(\epsilon \right)$ (30)

$\lambda \left(\tau \right)={P}_{a}\left(0\right){x}_{a}\left(\gamma \right)+{P}_{b}\left(1\right){x}_{b}\left(\beta \right)+Ο\left(\epsilon \right)$ (31)

$u\left(\tau \right)={K}_{a}{x}_{a}\left(\gamma \right)+{K}_{b}{x}_{b}\left(\beta \right)+Ο\left(\epsilon \right)$ (32)

$w\left(\tau \right)={L}_{a}{x}_{a}\left(\gamma \right)+{L}_{b}{x}_{b}\left(\beta \right)+Ο\left(\epsilon \right)$ (33)

4. 强化学习算法

4.1. 初始边界的博弈问题

$0=-{A}^{\text{T}}\left(0\right){P}_{a}\left(0\right)-{P}_{a}\left(0\right)A\left(0\right)+Q\left(0\right)+{P}_{a}\left(0\right)\left(B\left(0\right){\left({R}^{u}\left(0\right)\right)}^{-1}{B}^{\text{T}}\left(0\right)-D\left(0\right){\left({R}^{w}\left(0\right)\right)}^{-1}{D}^{\text{T}}\left(0\right)\right){P}_{a}\left(0\right)$ (34)

$\begin{array}{c}{x}_{a}^{\text{T}}\left(t\right){P}_{a}\left(0\right){x}_{a}\left(t\right)={\int }_{t}^{t+\delta t}{x}_{a}^{\text{T}}Q\left(0\right){x}_{a}+{\left({u}_{a}+{e}_{1}\right)}^{\text{T}}{R}^{u}\left(0\right)\left({u}_{a}+{e}_{1}\right)-{\left({w}_{a}+{e}_{1}\right)}^{\text{T}}{R}^{w}\left(0\right)\left({w}_{a}+{e}_{1}\right)\text{d}\upsilon \\ \text{\hspace{0.17em}}\text{ }\text{ }+{x}_{a}^{\text{T}}\left(t+\delta t\right){P}_{a}\left(0\right){x}_{a}\left(t+\delta t\right)\end{array}$ (35)

${\psi }^{\text{T}}\left[\begin{array}{c}\left[vec\left({P}_{a}^{k}\right)\right]\\ \left[vec\left({K}_{a}^{k+1}\right)\right]\\ \left[vec\left({L}_{a}^{k+1}\right)\right]\end{array}\right]=\theta$ (36)

$\psi ={\left[\begin{array}{ccc}{x}_{a}^{\text{T}}\otimes {x}_{a}^{\text{T}}& 2{\int }_{t}^{t+\delta t}{\left({x}_{a}\otimes {e}_{1}\right)}^{\text{T}}\text{d}\upsilon \left({I}_{n}\otimes {R}^{u}\left(0\right)\right)& 2{\int }_{t}^{t+\delta t}{\left({x}_{a}\otimes {e}_{2}\right)}^{\text{T}}\text{d}\upsilon \left({I}_{n}\otimes {R}^{w}\left(0\right)\right)\end{array}\right]}^{\text{T}}$

$\left[{x}_{a}^{\text{T}}\otimes {{x}_{a}^{\text{T}}|}_{t}^{t+\delta t}\right]\left[vec\left({P}_{a}^{k}\right)\right]=\left[{x}_{a}^{\text{T}}\otimes {{x}_{a}^{\text{T}}|}_{t}^{t+\delta t}\right]\left[vec\left({K}_{a}^{k}\right)\right]$

${\Phi }^{\text{T}}\left[\begin{array}{c}\left[vec\left({P}_{a}^{k}\right)\right]\\ \left[vec\left({K}_{a}^{k+1}\right)\right]\\ \left[vec\left({L}_{a}^{k+1}\right)\right]\end{array}\right]=\Theta$ (37)

$\left[\begin{array}{c}\left[vec\left({P}_{a}^{k}\right)\right]\\ \left[vec\left({K}_{a}^{k+1}\right)\right]\\ \left[vec\left({L}_{a}^{k+1}\right)\right]\end{array}\right]={\left(\Phi {\Phi }^{\text{T}}\right)}^{-1}\Phi \Theta$ (38)

4.2. 终端边界的博弈问题

5. 结论与展望