Hongyang LI,Qinglai WEI,3
1School of Artificial Intelligence,University of Chinese Academy of Sciences,Beijing 100049,China
2The State Key Laboratory for Management and Control of Complex Systems,Institute of Automation,Chinese Academy of Sciences,Beijing 100190,China
3Institute of Systems Engineering,Macau University of Science and Technology,Macau 999078,China
Abstract: This paper presents a novel optimal synchronization control method for multi-agent systems with input saturation.The multi-agent game theory is introduced to transform the optimal synchronization control problem into a multi-agent nonzero-sum game.Then,the Nash equilibrium can be achieved by solving the coupled Hamilton–Jacobi–Bellman (HJB) equations with nonquadratic input energy terms.A novel off-policy reinforcement learning method is presented to obtain the Nash equilibrium solution without the system models,and the critic neural networks (NNs)and actor NNs are introduced to implement the presented method.Theoretical analysis is provided,which shows that the iterative control laws converge to the Nash equilibrium.Simulation results show the good performance of the presented method.
Key words: Optimal synchronization control;Multi-agent systems;Nonzero-sum game;Adaptive dynamic programming;Input saturation;Off-policy reinforcement learning;Policy iteration
Multi-agent synchronization control has attracted much attention due to its high efficiency and computational performance (Wieland et al.,2011;Wei et al.,2018,2020,2021;Li JQ et al.,2021;Rehák and Lynnyk,2021;Zhang KQ et al.,2021).Generally speaking,synchronization control problems require that the agents synthesize to the same value (Cao et al.,2015;Garcia et al.,2017;Yang JY et al.,2019) or track the trajectories of leaders(Du et al.,2014;Zhao et al.,2014)by designing distributed control laws.Because of the practical significance of multi-agent systems,many researchers have devoted themselves to tackling various synchronization control problems,including switching topologies (Thunberg et al.,2014),system faults (Ma and Yang,2016) and so on (Han et al.,2013;Wei et al.,2015;He et al.,2018).Among the research works on synchronization control,optimal synchronization control,which requires each agent to minimize its own local performance index function,is a promising research direction.Multi-agent cooperative games provide an effective tool to study multi-agent optimal control problems,and they rely on solving coupled Hamilton–Jacobi(HJ)equations(Vamvoudakis et al.,2012).However,coupled HJ equations are hard to solve,which limits the applications of cooperative game theory in synchronization control problems.
Reinforcement learning is an effective method for solving coupled HJ equations.The main idea of reinforcement learning is to solve the coupled HJ equations forward by time,which can reduce the computational burden (Wang et al.,2009;Wei and Liu,2014;Wei et al.,2014,2016,2017;Zhang HG et al.,2015;Yang N et al.,2019;Zhang LD et al.,2019).In recent years,reinforcement learning has been further developed to solve multi-agent cooperative game problems.Vamvoudakis et al.(2012)proposed an online policy iteration method for optimal synchronization control problems;however,the external disturbance was not considered.In Jiao et al.(2016),a novel policy iteration method was proposed for the multi-agent zero-sum game problem,and disturbance rejection was achieved.In Wei et al.(2015),the graphical game was studied for heterogeneous multi-agent systems.The off-policy reinforcement learning method was proposed to solve multi-agent synchronization control problems by Li JN et al.(2017),and the input constraint was considered by Qin et al.(2019).However,there are few research results considering cooperative game problems with input saturation,which motivates our study.
In this paper,the multi-agent optimal synchronization control problem with input saturation is studied based on cooperative game theory and reinforcement learning.Compared with Qin et al.(2019),we consider coupled terms with neighboring agents in the performance index functions.The main contributions can be summarized as follows:
1.A novel off-policy reinforcement learning method is presented without the information of system models for cooperative game problems of multiagent systems.The control constraint and coupled terms in the performance index functions are considered which broaden the application scope of the presented method.
2.The characteristics of the presented modelfree off-policy reinforcement learning method,including convergence and optimality,are analyzed,showing that the solutions obtained from the presented method converge to the Nash equilibrium.
3.Critic neural networks (NNs) and actor NNs are used to implement the off-policy reinforcement learning algorithm.Simulation results verify the good performance of the presented method.
LetGr=(V,ε,E) be a directed graph,whereV={v1,v2,...,vN}denotes the nonempty finite vertex set.Furthermore,ε ?V ×Vis the set of edges.An edge of graphGris denoted asεij,which means that agentjis a neighbor of agenti.E=[eij]∈RN×Nis the adjacency matrix,whereeijrepresents the weight of edgeεij.Ifεij ∈ε,eij >0;otherwise,eij=0.Let the set of neighbors of agentibeNi={vj|(vj,vi)∈ε}.DefineG=diag(gi)∈RN×Nas the pinning matrix.If agentihas access to the leader,gi >0;otherwise,gi=0.Define the Laplacian matrix asL=D-E,whereD=diag(di)and
Fori=1,2,...,N,consider the following systems:
wherexi ∈Rnandui ∈Ui ?Rmare the system state and control,respectively.Here,AandBare system matrices with suitable dimensions.Ui={ui|ui ∈Rm,‖ui‖∞≤λi}(λi >0 is a known constant).Let the leader dynamics be
wherex0∈Rnis the system state.Then,we can define the synchronization error as
Taking the derivative of Eq.(3),we have
For system (4),the performance index function can be given as
where the termu-irepresents the policies of the neighbors of agenti.Qii >0,
andΨ-1is the inverse function of the hyperbolic tangent function (i.e.,Ψ-1=arctanh(·),or equivalently,Ψ(·) ?tanh(·)).Then,Ri(ui) andRi(uj)can be written as
For Eq.(5),the Nash equilibrium condition can be described as
For agenti,we define the iterative value function as
Then,we can obtain the Bellman equation as
withVi(0)=0.According to the stationary condition(Bertsekas,2007),it can be derived that
Substituting Eq.(11) into Eq.(10),we can obtain the Hamilton–Jacobi–Bellman(HJB)equation as
withVi(0)=0,where
We would like to designui,such that the Nash equilibrium condition represented in inequality (8)and the following state synchronization condition are satisfied:
Remark 1The performance index function considered by Qin et al.(2019)is defined as
Comparing Eqs.(5)and(14),it can be seen that the coupled terms with neighboring agents in the performance index function are considered in this study.In multi-agent systems,the behavior of agentimay have an impact on its neighboring agents.Therefore,the performance index function(5)is more natural for the optimal synchronization control of multiagent systems.
Remark 2Based on Eq.(3),it can be derived that
“?”is the Kronecker product,andInis an identity matrix with dimensionn.According to Vamvoudakis et al.(2012),we have
whereσmin(·)represents the minimum singular value of the matrix.Therefore,the stability of the tracking error dynamics represented in Eq.(4)guarantees the state synchronization condition denoted by Eq.(13).
A theorem is provided,which shows that the solution to the HJB equation(i.e.,Eq.(12))satisfies the Nash equilibrium condition (i.e.,inequality (8))under certain conditions.
Theorem 1Assume that the optimal control lawis given as shown in Eq.(11),and thatViis the positive definite smooth solution to the HJB equation(12).Then,system(4)is asymptotically stable.The optimal control laws(i=1,2,...,N)constitute the Nash equilibrium,and the solutionVito the HJB equation(12)is the optimal value of the game,i.e.,
ProofChoosing the iterative value functionVias the Lyapunov function,it can be derived that
According to Eq.(12),we have
Because of the asymptotic stability of system(4)and the boundary conditionVi(0)=0,it can be derived thatVi(δi(∞))=0.Then,substituting Eq.(12)into Eq.(19)and completing the squares,we can obtain
For Eq.(21),it can be derived that
whereΨ-1=arctanh(·)is monotonically increasing,i.e.,Ψ-1)′>0.Therefore,based on the mean value theorem for integrals,it can be derived that
In the previous subsection,it was derived that the optimal control,represented in Eq.(11),can be calculated to construct the Nash equilibrium represented in inequality (8).However,the optimal control in Eq.(11) requires the information of,which can be calculated from the HJB equation(12).The HJB equation (12) is a nonlinear partial differential equation,which is hard to solve analytically.Therefore,a policy iteration method is provided(Algorithm 1) to solve the HJB equation (12) numerically.Then,a theorem can be provided,which shows the convergence of the presented policy iteration method.
Theorem 2Assume that agentiand its neighbors update their control policies according to Algorithm 1.Then,policiesconverge to the Nash equilibrium,andV kiconverges to,whereis the solution to the HJB equation(12).
ProofIntegratingalong the system
we have
Based on Eq.(24),we have
Subtracting and adding Eqs.(24) and (28) to the right-hand side of Eq.(27),we have
For Eq.(29),we have
Therefore,we can rewrite Eq.(27)as
Based on Eq.(22) and inequality (23),it can be derived that
However,system matrices are still required to solve Eqs.(24) and (25).A novel off-policy reinforcement learning method will be presented in the next subsection without the information of system matrices.
We can rewrite the tracking error dynamics,represented in Eq.(4),as follows:
Taking the derivative ofalong system (34),we have
According to Eq.(25),we have
Then,substituting Eqs.(24)and (36) into Eq.(35),it can be derived that
Therefore,it can be seen that the system matrices are not included in Eq.(37).Based on the Weierstrass high-order approximation theorem(Abu-Khalaf and Lewis,2005),the critic and actor NNs can be introduced as
whereφi ∈Rhvandφuil1∈Rhul1(l1=1,2,...,m)are activation functions.are constant weights.Eq.(39) can be written in the following compact form:
Substituting Eqs.(38)and(39)into Eq.(37),we can obtain Eq.(41)(on the top of the next page),whererepresents the residual error,and=δi(t′).Then,Eq.(41) can be written in a simplified form as follows:
Based on the least square approach,it can be obtained that
A model-free off-policy reinforcement learning algorithm can be provided in Algorithm 2.
Lemma 1For system (4),if the iterative value functions and iterative control laws are designed as Eqs.(38) and (39),respectively,where the weights(l1=1,2,...,m) are updated as Algorithm 2,.
The proof process can be found in Li JN et al.(2017) and Qin et al.(2019),and thus is omitted here.
Remark 3In Algorithm 2,the selection of the control lawsui(i=1,2,...,N)is the key to the convergence of the algorithm.Generally,the control laws are selected asui=-Kiδi+ξi(i=1,2,...,N),whereξiis the exploration noise andKiis the stabilizing gain matrix.
Remark 4For traditional on-policy integral reinforcement learning methods(Vrabie and Lewis,2011;
Liu et al.,2021),the performance index function is evaluated using the inaccurate data,which cause a biased estimation.The presented off-policy reinforcement learning method can avoid this problem and thus obtain results with higher accuracy.
In this section,a simulation example is provided to show the good performance of the presented method.The structure of the multi-agent systems is shown in Fig.1,with the following dynamics:
Fig.1 Structure of the multi-agent systems
The Laplacian matrix and the pinning matrix are given as
We define the weight matrices of the performance index function,represented by Eq.(5),as follows:
The simulation is performed withx0(0)=[1 1]T,x1(0)=[0.5-0.5]T,x2(0)=[1-0.5]T,x3(0)=[2-1]T,λ1=2,λ2=1.5,λ3=3.First,we collect the system data{δi,ui}every 0.01 s fori=1,2,3.Then,we solve Eq.(43) iteratively based on the collected system data.The activation functionsφi(δi)andφui(δi)are chosen as
The simulation results are shown in Figs.2–6.The weights of critic and actor NNs are shown in Figs.2 and 3,respectively,which show the stability of Algorithm 2.The synchronization error curves are provided in Fig.4,and the three-dimensional curves are provided in Fig.5.From Figs.4 and 5,it can be derived that the optimal synchronization control is achieved.The control curves are shown in Fig.6,verifying that the control constraint is satisfied.
Fig.2 Weights of critic neural networks of the multiagent systems
Fig.3 Weights of actor neural networks of the multiagent systems
Fig.4 Synchronization errors of the multi-agent systems
Fig.5 Three-dimensional curves of the multi-agent systems
Fig.6 Control laws of the multi-agent systems
The nonzero-sum game problem of multi-agent systems with input saturation has been studied based on the model-free off-policy reinforcement learning method.It is shown that the presented off-policy reinforcement learning algorithm can make the iterative control laws converge to the Nash equilibrium without the information of system models.The simulation results showed the good performance of the presented method.
Contributors
Hongyang LI designed the method,conducted the simulation,and drafted the paper.Qinglai WEI revised and finalized the paper.
Compliance with ethics guidelines
Hongyang LI and Qinglai WEI declare that they have no conflict of interest.
Frontiers of Information Technology & Electronic Engineering2022年7期