基于强化学习的多智能体系统一致性跟踪控制算法

doi:10.12677/csa.2025.154110

期刊菜单

基于强化学习的多智能体系统一致性跟踪控制算法
Reinforcement Learning-Based Consensus Tracking Control Algorithm for Multi-Agent Systems

DOI: 10.12677/csa.2025.154110, PDF, HTML, XML,
作者: 刘人志：广东工业大学自动化学院，广东广州
关键词: 强化学习；非线性多智能体系统；跟踪控制；动态线性化；Reinforcement Learning； Nonlinear Multi-Agent System； Tracking Control； Dynamic Linearization

摘要: 本文提出了一种新颖的基于强化学习的无模型自适应控制算法，适用于具有未知动态的离散时间非线性多智能体系统。采用等效动态线性化算法来设计最优控制器。针对Q学习策略和演员–评论家(actor-critic)神经网络进行了重构，以促进一致性控制。所提出的强化学习方法能够仅基于输入–输出数据实时动态调整线性化参数。通过数值仿真验证了该方法的有效性。

Abstract: This paper presents a novel reinforcement learning-based model-free adaptive control algorithm for discrete-time nonlinear multi-agent systems with unknown dynamics. The equivalent dynamic linearization algorithm is employed to design an optimal controller. The Q-Learning strategy and actor-critic neural network are restructured to facilitate consensus control. The proposed reinforcement learning approach dynamically adjusts linearization parameters in real-time using only input-output data. Numerical simulations validate the method’s effectiveness.

文章引用：刘人志. 基于强化学习的多智能体系统一致性跟踪控制算法[J]. 计算机科学与应用, 2025, 15(4): 374-381. https://doi.org/10.12677/csa.2025.154110

1. 引言

近年来，随着计算能力的提升，神经网络在解决复杂非线性系统的跟踪控制问题方面展现出了巨大潜力。强化学习(Reinforcement Learning, RL)是一种机器学习的分支，其中智能体(agent)通过与环境的交互进行决策训练。在强化学习中，智能体通过试错学习，并根据获得的奖励或惩罚来优化决策策略。深度学习的引入使强化学习能够处理复杂的高维输入空间[1]，从而大幅扩展了其适用范围。随着强化学习的发展及其有效性的提升，越来越多的控制方法基于强化学习算法被设计。在[2]中，强化学习方法通过跟踪误差变换实现了数据驱动的最优控制器的演化；在[3]中，强化学习被集成到迭代学习控制中。

无模型自适应控制(Model-Free Adaptive Control, MFAC)是一种无需精准数学模型即可运行的控制策略，特别适用于动态特性复杂或未知的系统。MFAC方法能够根据当前系统性能调整控制策略，其自适应性使其非常适用于处理许多实际应用中的非线性和不确定性问题。在[4]中，事件触发的MFAC方法被用于在传感器故障和拒绝服务(DoS)攻击情况下保证跟踪性能。在[5]中，MFAC方法成功解决了无人水面艇的航向控制问题，并有效应对了系统的不确定性。此外，在[6]中，MFAC与神经网络的结合实现了更精确的控制效果。

在多智能体系统(Multi-Agent Systems, MASs)领域，控制问题的复杂性显著增加。MAS由多个智能体组成，这些智能体在决策过程中协同工作，并不断变化，表现出复杂的非线性行为[7] [8]。一致性控制[9] [10]是MAS研究的核心内容，旨在使所有智能体在个体差异和外部不确定性影响下仍能实现共同目标。确保系统协调行为是一项重大挑战，且受制于系统的非线性特性。[11]和[12] [13]分别采用了非周期采样数据方法和事件触发方法，实现了对复杂非线性系统的有效一致性控制。

结合强化学习与无模型自适应控制以控制MASs提供了一种新颖而有效的方法。该组合充分利用了两种方法的优势，为离散非线性MAS的一致性控制提供了全面的解决方案。基于上述讨论，本文的主要贡献可总结如下：本文研究了强化学习和无模型自适应控制在MASs中的集成，并重新设计了Q学习策略和演员–评论家(actor-critic)神经网络，使其更适用于MASs。通过重新设计，这些方法能够更精确地逼近最优动态线性化参数。结合MFAC方法，所提出的算法在MASs的一致性控制方面表现出较强的能力。此外，该算法显著增强了对环境变化的适应性和鲁棒性，并加快了收敛速度。通过仿真实验进一步验证了其实用性。

2. 控制器设计

在多智能体系统中，网络结构由 $G = (V, ℰ, A)$ 表示，其中 $V$ 表示智能体集合 ${1, 2, \dots, N}$ 。集合 $ℰ$ 是 $V \times V$ 的子集，描述了智能体之间可能的通信连接。邻接矩阵 $A$ 的元素为 $a_{i j}$ ，其中 $a_{i j} = 1$ 表示智能体j到智能体i存在直接通信链接，而 $a_{i j} = 0$ 表示没有直接通信。此外，当智能体i接收来自领导者的数据时，设定 $d_{i} = 1$ ；否则， $d_{i} = 0$ 。对角矩阵 $D$ 定义为 $d i a g (d_{1}, d_{2}, \dots, d_{N})$ 。每个智能体i的邻居集合表示为 $N_{i} = {j \in V | (j, i) \in ℰ}$ 。智能体i的入度(in-degree)定义为 $d_{i}^{in} = \sum_{j = 1}^{N} a_{i j}$ 。最终，拉普拉斯矩阵(Laplacian Matrix) $ℒ$ 由 $ℒ = L - A$ 给出，其中L是 $d i a g (d_{1}^{in}, d_{2}^{in}, \dots, d_{N}^{in})$ 。

由N个智能体组成的MAS的动力学可以表示为：

$y_{i} (t + 1) = f_{i} (y_{i} (t), y_{i} (t - 1), \dots, y_{i} (t - n_{y}), u_{i} (t), u_{i} (t - 1), \dots, u_{i} (t - n_{u}))$ (1)

其中， $i \in V$ ， $f_{i} (\cdot)$ 代表一个未知的非线性函数。变量 $y_{i} (t) \in ℝ$ 表示智能体i在离散时间 $t \in ℤ^{+}$ 时刻的输出， $u_{i} (t) \in ℝ$ 表示对应的输入。常数 $n_{y}$ 和 $n_{u}$ 分别表示系统的未知阶数。

根据文献[14]，如果存在理想的非线性控制器，则系统(1)可以转换为如下形式：

$u_{i} (t) = C_{i} (e_{i} (t + 1), \dots, e_{i} (t - n_{e}), u_{i} (t - 1), \dots, u_{i} (t - n_{c}))$ (2)

其中， $C_{i} (\cdot)$ 是智能体i的未知非线性函数， $e_{i} (t) = y_{d} (t) - y_{i} (t)$ 代表智能体i的跟踪误差， $y_{d} (t)$ 是理想系统输出， $n_{e}, n_{c} \in ℤ^{+}$ 是未知阶数。

假设系统满足以下条件：

假设1：对于任何时间 $t \in ℤ^{+}$ ，控制器 $C_{i} (\cdot)$ 相对于误差输入 $e_{i} (t + 1), e_{i} (t - 1), \dots, e_{i} (t - n_{e} + 2)$ 和控制输入 $u_{i} (t), u_{i} (t - 1), \dots, u_{i} (t - n_{c})$ 的偏导数是连续的，其中 $n_{e}, n_{c}$ 是预设的常数。

假设2：控制器(2)满足广义Lipschitz条件，即：

$| Δ u_{i} (t) | \leq L ‖ ζ_{i} (t) ‖ | Δ u_{i} (t) | \leq L | ζ_{i} (t) | | Δ u_{i} (t) | \leq L ‖ ζ_{i} (t) ‖$ (3)

其中， $L > 0$ 是Lipschitz常数， $ζ_{i} (t) = {[Δ e_{i} (t + 1), \dots, Δ e_{i} (t - n_{e} + 2), Δ u_{i} (t - 1), \dots, Δ u_{i} (t - n_{c})]}^{T}$ ， $Δ u_{i} (t) = u_{i} (t) - u_{i} (t - 1)$ 。

假设3：存在理想的控制器参数，使得 $e_{i} (t + 1) = 0$ 成立。

假设4：虚拟领导者(编号0)的输出，即理想轨迹 $y_{d} (t)$ ，至少被一个跟随者(follower)智能体所知，并且该智能体的信息可以沿着拓扑结构的有向路径传播至所有智能体。

引理1：对于满足假设1~3的系统智能体i，根据文献[14]，理想控制器(2)可以转换为：

$Δ u_{i} (t) = {\bar{ζ}}_{i}^{T} (t) ψ_{i} (t)$ (4)

${\bar{ζ}}_{i} (t) = {[- e_{i} (t), \dots, Δ e_{i} (t - n_{e} + 2), Δ u_{i} (t - 1), \dots, Δ u_{i} (t - n_{c})]}^{T}$ 和 $ψ_{i} (t) = {[ψ_{i, 1} (t), \dots, ψ_{i, n_{e} + n_{c}} (t)]}^{T}$ 是有界的。

引理2：对于方程(1)，采用等效动态线性化(Equivalent Dynamic Linearization, EDL)技术[15]，可推导出系统(1)的等效数据模型，并将其表示为部分形式动态线性化(Partial Form Dynamic Linearization, PFDL)形式：

$y_{i} (t + 1) = y_{i} (t) + v_{i}^{T} (t) ϕ_{i} (t)$ (5)

$v_{i}^{T} (t) = [Δ u_{i} (t), \dots, Δ u_{i} (t - n_{u} + 1), Δ y_{i} (t), \dots, Δ y_{i} (t - n_{y} + 1)]$ ， $ϕ_{i} (t) = {[ϕ_{i, 1} (t), \dots, ϕ_{i, n_{u} + n_{y}} (t)]}^{T}$ 。

随后，基于数据模型参数 $ϕ_{i} (t)$ 的调整规则，在每个采样时刻的更新公式如下：

${\hat{ϕ}}_{i} (t) = {\hat{ϕ}}_{i} (t - 1) - Γ {\hat{e}}_{i} (t) v_{i} (t - 1)$ (6)

其中， $Γ$ 是步长矩阵， ${\hat{e}}_{i} (t) = {\hat{y}}_{i} (t) - y_{i} (t)$ 。因此，PFDL形式(5)可以转换为：

${\hat{y}}_{i} (t + 1) = y_{i} (t) + v_{i}^{T} (t) {\hat{ϕ}}_{i} (t)$ (7)

3. 强化学习算法设计

在第三部分中，提出了一种创新的无模型自适应控制方法，该方法结合了强化学习技术。其主要创新点在于构建了一种基于强化学习算法的Actor-Critic神经网络框架的控制器架构。此外，通过将动态线性化技术与强化学习过程相结合，可以保证系统的稳定性。

3.1. 价值函数

价值函数 $r_{i} (t) \in ℝ$ 由一致性误差 $ξ_{i} (t)$ 表示，即：

$r_{i} (t) = α | ξ_{i} (t) |$ (8)

$ξ_{i} (t) = \sum_{j \in N_{i}} a_{i j} (y_{j} (t) - y_{i} (t)) + d_{i} (y_{d} (t) - y_{i} (t))$ (9)

其中， $α > 0$ 是预设参数。

性能指标 $Q_{i} (t) \in ℝ$ 表示为：

$Q_{i} (t) = β^{N} r_{i} (t + 1) + β^{N - 1} r_{i} (t + 2) + \dots + β^{k + 1} r_{i} (N) + \dots$ (10)

其中， $0 < β < 1$ 和 $N > 0$ 为预设常数，该公式可简化为以下形式：

$Q_{i} (t) = {min}_{u_{i} (t)} (β Q_{i} (t - 1) - β^{N + 1} r_{i} (t))$ (11)

3.2. Actor-Critic神经网络

Critic Network用于估计值函数 $Q_{i} (t)$ ，而Actor Network用于逼近控制器参数 $ψ_{i} (t)$ 。它们的数学表达式如下：

$\hat{Q} i (t) = \hat{W} i, c^{T} (t) H_{i, c} (t)$ (12)

$\hat{ψ} i (t) = \hat{W} i, a^{T} (t) H_{i, a} (t)$ (13)

${\hat{ψ}}_{i} (t) = s a t (\underline{\hat{ψ}}, \bar{\hat{ψ}})$

其中， $\underline{\hat{ψ}}$ 和 $\bar{\hat{ψ}}$ 为预定义常数， ${\hat{W}}_{i, c}^{T}$ 和 ${\hat{W}}_{i, a}^{T}$ 分别表示评论神经网络和演员神经网络的权重矩阵。演员网络和评论网络的输出激活分别表示为 $H_{i, a}$ 和 $H_{i, c}$ 。

$h_{i, c, m} (t) = exp (- {| ξ_{i} (t) - c_{c, m} |}^{2} / γ_{i}^{2}), m \in {1, \dots, L_{c}}$ (14)

$h_{i, a, n} (t) = exp (- {| ξ_{i} (t) - c_{a, n} |}^{2} / γ_{i}^{2}), n \in {1, \dots, L_{a}}$ (15)

其中， $h_{i, c, m}$ 和 $h_{i, a, n}$ 分别表示评论神经网络的第m层隐藏层激活和演员神经网络的第n层隐藏层激活， $γ_{i}$ 和c分别指定隐藏层的宽度和中心。评论神经网络的目标函数表示如下：

$E_{i, c} (t) = \frac{1}{2} e_{i, c}^{2} (t)$ (16)

其中， $e_{i, c} (t) = {\hat{Q}}_{i} (t) - β ({\hat{Q}}_{i} (t - 1) - β^{N} r_{i} (t))$ ，权重更新规则如下：

$\hat{W} i, c (t + 1) = \hat{W} i, c (t) - η_{c} H_{i, c} (t) (\hat{Q} i (t) + β^{N + 1} r i (t) - β {\hat{Q}}_{i} (t - 1))$ (17)

${\hat{W}}_{i, a}^{T} (t + 1) = {\hat{W}}_{i, a}^{T} (t - 1) - \frac{η_{a} {\bar{ζ}}_{i} (t - 1) (e_{i} (t) - {\hat{Q}}_{i, a} (t)) {\hat{ϕ}}_{i, 1} (t - 1) H_{i, a}^{T} (t - 1)}{1 + | {\bar{ζ}}_{i} (t - 1) |^{2} | H_{i, a} (t - 1) |^{2} | {\hat{ϕ}}_{i, 1} (t - 1) |^{2}}$ (18)

其中， $η_{a}, η_{c} \in ℝ$ 为更新速率， ${\hat{Q}}_{i, a} (t) = {\hat{Q}}_{i} (t) α s i g n (ξ_{i} (t)) β^{N + 1}$ 。随后，控制信号表示为：

$Δ u_{i} (t) = \bar{ζ} i^{T} (t) \hat{ψ} i (t)$ (19)

为了更好地说明本文讨论的控制技术，RLMFAC算法的架构如图1所示。

Figure 1. The architecture of RLMAFC algorithm

图1. RLMFAC算法结构

备注1：由于系统模型不可直接获取，因此无法直接计算 $ϕ_{i} (t - 1)$ 。因此，在RLMFAC算法中，必须使用数据模型来对 $ϕ_{i} (t - 1)$ 进行逼近。

备注2：与传统MFAC方法不同，以往方法通常使用输入输出数据逼近 $\partial y_{i} (t + 1) / \partial u_{i} (t)$ 和 $y_{i} (t + 1)$ ，而当前RLMFAC策略在控制器设计过程中无需逼近 $y_{i} (t + 1)$ ，从而大大拓展了其在实际应用中的适用范围。

4. 仿真验证

本节展示了仿真结果，以验证所提出算法在多智能体系统(MASs)中的有效性。

对于包含四个智能体及一个领导者(编号0)的MAS，其网络拓扑结构如图2所示。系统的动力学方程如下：

$y_{1} (t + 1) = 1.2 u_{1} (t) (5 + \cos (2 y_{1} (t) u_{1} (t))) + 0.8 \sin (y_{1} (t))$

$y_{2} (t + 1) = 0.8 u_{2} (t) (4 + \cos (3 y_{2} (t) u_{2} (t))) + \sin (y_{2} (t))$

$y_{3} (t + 1) = 1.1 u_{3} (t) (3 + \cos (4 y_{3} (t) u_{3} (t))) + 1.2 \sin (y_{3} (t))$

$y_{4} (t + 1) = 0.9 u_{4} (t) (6 + cos (y_{4} (t) u_{4} (t))) + 0.6 sin (y_{4} (t))$

Figure 2. The network topology structure of the MAS

图2. 多智能体系统网络拓扑结构

其中， $t \in ℤ^{+}$ 表示采样时间。理想轨迹 $y_{d} (t)$ 定义如下：

$y_{d} (t) = 0.55 + 0.25 (sin (\frac{2 π t}{50}) + sin (\frac{2 π t}{100}) + sin (\frac{2 π t}{150}))$

系统的初始条件设定如下： $ϕ_{i} (1) = {[1, 1, 1, 1]}^{T}, u_{i} (1) = 0, y_{i} (1) = 0, {\hat{Q}}_{i} (1) = 0, {\hat{W}}_{i, c} (1) = 0_{27 \times 1}, H_{i, c} (1) = 0_{27 \times 1}, {\hat{W}}_{i, a} (1) = 0_{9 \times 4}, H_{i, a} (1) = 0_{9 \times 4}$ 。

仿真参数设定如下： $Γ = 0.001 \times I_{4 \times 4}, L = 9, N = 4, n_{u} = 1, n_{y} = 3, n_{c} = 1, n_{e} = 3$ , $η_{a} = 0.06, η_{c} = 0.03, α = 7, β = 0.1$ 。

仿真结果如图3~5所示。图3展示了RLMFAC算法对理想轨迹 $y_{d} (t)$ 的跟踪性能，图4显示了相应的跟踪误差，而图5展示了控制信号的轨迹。

Figure 3. The tracking performances of RLMFAC algorithm

图3. RLMFAC算法的跟踪效果

Figure 4. The tracking errors of RLMFAC algorithm

图4. RLMFAC算法的跟踪误差

Figure 5. The trajectories of input signals

图5. 输入信号轨迹

仿真结果表明，RLMFAC算法在MASs中具有良好的性能。因此，该算法能够有效应用于非线性离散时间系统。

5. 结论

在本研究中，提出了RLMFAC算法，旨在解决离散时间非线性多智能体系统中的一致性跟踪问题。该方法的核心基于专门设计的值函数策略，以应对MASs的复杂性挑战。仿真结果表明，RLMFAC算法在跟踪理想轨迹方面具有良好的效果。然而，该算法在跟踪快速变化的目标时仍存在一定的困难。因此，使RLMFAC算法适应各种实际应用场景仍然是我们持续研究的目标。

参考文献

[1]	Schilling, M., Melnik, A., Ohl, F.W., Ritter, H.J. and Hammer, B. (2021) Decentralized Control and Local Information for Robust and Adaptive Decentralized Deep Reinforcement Learning. Neural Networks, 144, 699-725. https://doi.org/10.1016/j.neunet.2021.09.017
[2]	Wang, N., Gao, Y. and Zhang, X. (2021) Data-Driven Performance-Prescribed Reinforcement Learning Control of an Unmanned Surface Vehicle. IEEE Transactions on Neural Networks and Learning Systems, 32, 5456-5467. https://doi.org/10.1109/tnnls.2021.3056444
[3]	Zhang, Y., Chu, B. and Shu, Z. (2019) A Preliminary Study on the Relationship between Iterative Learning Control and Reinforcement Learning. IFAC-PapersOnLine, 52, 314-319. https://doi.org/10.1016/j.ifacol.2019.12.669
[4]	Yue, B., Su, M., Jin, X. and Che, W. (2022) Event-Triggered MFAC of Nonlinear NCSs against Sensor Faults and Dos Attacks. IEEE Transactions on Circuits and Systems II: Express Briefs, 69, 4409-4413. https://doi.org/10.1109/tcsii.2022.3178881
[5]	Liao, Y., Jiang, Q., Du, T. and Jiang, W. (2020) Redefined Output Model-Free Adaptive Control Method and Unmanned Surface Vehicle Heading Control. IEEE Journal of Oceanic Engineering, 45, 714-723. https://doi.org/10.1109/joe.2019.2896397
[6]	Wang, X., Karimi, H.R., Shen, M., Liu, D., Li, L. and Shi, J. (2022) Neural Network-Based Event-Triggered Data-Driven Control of Disturbed Nonlinear Systems with Quantized Input. Neural Networks, 156, 152-159. https://doi.org/10.1016/j.neunet.2022.09.021
[7]	Dorri, A., Kanhere, S.S. and Jurdak, R. (2018) Multi-Agent Systems: A Survey. IEEE Access, 6, 28573-28593. https://doi.org/10.1109/access.2018.2831228
[8]	Chen, F. and Ren, W. (2019) On the Control of Multi-Agent Systems: A Survey. Foundations and Trends® in Systems and Control, 6, 339-499. https://doi.org/10.1561/2600000019
[9]	Olfati-Saber, R., Fax, J.A. and Murray, R.M. (2007) Consensus and Cooperation in Networked Multi-Agent Systems. Proceedings of the IEEE, 95, 215-233. https://doi.org/10.1109/jproc.2006.887293
[10]	Amirkhani, A. and Barshooi, A.H. (2021) Consensus in Multi-Agent Systems: A Review. Artificial Intelligence Review, 55, 3897-3935. https://doi.org/10.1007/s10462-021-10097-x
[11]	Zhao, W., Chen, G., Xie, X., Xia, J. and Park, J.H. (2023) Sampled-Data Exponential Consensus of Multi-Agent Systems with Lipschitz Nonlinearities. Neural Networks, 167, 763-774. https://doi.org/10.1016/j.neunet.2023.09.003
[12]	Ren, H., Liu, R., Cheng, Z., Ma, H. and Li, H. (2024) Data-Driven Event-Triggered Control for Nonlinear Multi-Agent Systems with Uniform Quantization. IEEE Transactions on Circuits and Systems II: Express Briefs, 71, 712-716. https://doi.org/10.1109/tcsii.2023.3305946
[13]	Ma, H., Li, H., Lu, R. and Huang, T. (2020) Adaptive Event-Triggered Control for a Class of Nonlinear Systems with Periodic Disturbances. Science China Information Sciences, 63, Article ID: 150212. https://doi.org/10.1007/s11432-019-2680-1
[14]	Zhu, Y. and Hou, Z. (2014) Data-Driven MFAC for a Class of Discrete-Time Nonlinear Systems with RBFNN. IEEE Transactions on Neural Networks and Learning Systems, 25, 1013-1020. https://doi.org/10.1109/tnnls.2013.2291792
[15]	Hou, Z., Chi, R. and Gao, H. (2017) An Overview of Dynamic-Linearization-Based Data-Driven Control and Applications. IEEE Transactions on Industrial Electronics, 64, 4076-4090. https://doi.org/10.1109/tie.2016.2636126

为你推荐

友情链接