Yu SHI,Yongzhao HUA,Jianglong YU,Xiwang DONG,,Zhang REN
1School of Automation Science and Electrical Engineering,Beihang University,Beijing 100191,China 2Institute of Artificial Intelligence,Beihang University,Beijing 100191,China
Abstract: This paper studies the multi-agent differential game based problem and its application to cooperative synchronization control.A systematized formulation and analysis method for the multi-agent differential game is proposed and a data-driven methodology based on the reinforcement learning (RL) technique is given.First,it is pointed out that typical distributed controllers may not necessarily lead to global Nash equilibrium of the differential game in general cases because of the coupling of networked interactions.Second,to this end,an alternative local Nash solution is derived by defining the best response concept,while the problem is decomposed into local differential games.An off-policy RL algorithm using neighboring interactive data is constructed to update the controller without requiring a system model,while the stability and robustness properties are proved.Third,to further tackle the dilemma,another differential game configuration is investigated based on modified coupling index functions.The distributed solution can achieve global Nash equilibrium in contrast to the previous case while guaranteeing the stability.An equivalent parallel RL method is constructed corresponding to this Nash solution.Finally,the effectiveness of the learning process and the stability of synchronization control are illustrated in simulation results.
Key words: Multi-agent system;Differential game;Synchronization control;Data-driven;Reinforcement learning
Cooperative control of multi-agent systems(MASs)has been a significant part of the networked control field in the past decades due to the wide applications in unmanned vehicles,robotics,and so on.The consensus problem was previously investigated in Olfati-Saber and Murray (2004) and Ren and Beard(2005)based on neighboring information through networks,which provides a fundamental methodology for subsequent studies.The synchronization problem took a step forward,where agents not only reach a common constant value but also track a leader’s trajectory using local interactions(Qin et al.,2011;Dong et al.,2014).With the development of related studies,researchers started to pay attention to the optimality in cooperative control.Distributed optimization (Yang T et al.,2019) has attracted sufficient attention,and has been applied in a wide range of practical scenarios such as smart grids (Zheng et al.,2016;Peng and Low,2018;Wen et al.,2021) and intelligent transportation (Wang MY et al.,2021).This can be decomposed into a cooperative synchronization control problem with a predefined optimal reference.Agents simultaneously execute the consensus and gradient-decent algorithms based on local objectives,while the global objective is achieved.By further considering the local coupled objectives,which reflect the conflicts of interest among agents,the distributed game problem(Sun et al.,2017) was proposed.The distributed optimization method can be extended to the algorithm commonly called the Nash seeking strategy(Ye et al.,2018,2019).
In contrast to distributed optimization with immediate and static objectives,differential game problems formulate an optimization for the controller of dynamic systems.Originating in the optimal control problem (Lewis et al.,2012),the coupling effects of individual actions as well as index functions in dynamic systems increase the complexity of this problem.There are two general types of differential game problems:(1) zero-sum differential game between two players;(2) nonzero-sum differential game between multiple players.The zero-sum differential game is aimed to find two balanced controllers,with one minimizing an index function and the other maximizing the same.This game model,in which each agent has competitive interests,has been commonly used inH2andH∞control problems(Modares et al.,2015).The nonzero-sum differential game is a more general case,where each agent holds individual,either competitive or cooperative,coupled index functions(Vamvoudakis and Lewis,2011).Due to the existence of multiple players,the pursuit of an optimal/balanced solution turns to finding the global Nash equilibrium.Early related work was studied and summarized as game theory(Ba?ar and Olsder,1982),and some recent results were given in Zhu and Ba?ar(2015),Zhao DB et al.(2015),Wang W et al.(2020),and Zhao JG (2020).However,this differential game is based on a unified system with players being regarded as control inputs,and all players are aware of full system states.
As a combination of distributed MAS control and game theory,the differential game in a topology graph,namely the graphical game,has attracted more attention in recent years.Graphical game is the fundamental in attack-defense,pursuit-evasion problems.Vamvoudakis et al.(2012) built a standard graphical game formulation for linear MAS consensus control,where the index functions were defined according to neighboring errors and neighbors’control actions.Each agent optimized its own index function using the best response method,and the Nash equilibrium was proved to be obtained.In addition,agents can update their optimization results either alternately or simultaneously.Further analysis presented a comparison of the distributed graphical game and the centralized game for discrete-time (DT) MASs (Abouheaf et al.,2014),where it was shown that the distributed graphical game is more challenging due to complex couplings whereas the key to handling this issue is solving coupled Hamilton–Jacobi(HJ) equations.However,HJ equations are rather difficult to solve due to complex interactions.Thus,data-driven methods have been considered to find approximate solutions.
Reinforcement learning (RL) (Sutton and Barto,1998),inspired by natural biological learning mechanisms,provides an online adaptive methodology for decision and control problems.Past results have shown that RL can equivalently deal with optimal control problems for DT (Tamimi et al.,2008)and continuous-time(CT)(Modares and Lewis,2014) systems without using actual dynamic models.Apparently,this property is suitable for solving the differential game.A Q-learning-based method was implemented to deal with a linear DT zero-sum game in Yang YJ et al.(2021).H∞r(nóng)obust control for DT systems was modeled as a zerosum game between the controller and disturbance in Modares et al.(2015),and the problem was solved using the integral RL method.Recent work has applied the RL-based data-driven method to graphical game problems.The optimal DT MAS consensus problem was investigated in Zhang et al.(2017)under the graphical game frame,and the solution was directly obtained online using neural networks(NNs)and the policy iteration method.This was extended to the synchronization problem using the Q-learning method in Wang MY et al.(2021),where it was proved that the controller stabilizes leader–follower MASs while achieving respective equilibrium.Off-policy RL was implemented in the CT MAS synchronization problem for the controller to satisfy the Nash equilibrium in the graphical game (Li et al.,2017;Mu et al.,2017).
It is worth pointing out that the Nash property in the graphical game should be further discussed.Although RL-based data-driven methods provide efficient ways to solve coupled HJ equations,the NNs that represent the controllers were usually designed as related only to the immediate neighboring error.As stated in the latest research (Liu et al.,2021;Qian et al.,2021),the approximate results may not satisfy the global Nash equilibrium condition due to the couplings.A typical min–max method was used to deal with the local optimization problem in Lopez et al.(2020).Li et al.(2017) tried to reduce the couplings between connected agents by choosing local related index functions.However,to the best of our knowledge,there are still few guidelines for the formulation and analysis of graphical game based cooperative synchronization of CT MASs and the solution using an RL method.Finding an institutionalized and systematized MAS graphical game framework remains an open question.
To this end,this paper studies the multi-agent differential graphical game with respect to the cooperative synchronization control problem using a data-driven RL method.First,by analyzing the solvability and solution structure of the HJ equations,a contradiction of distributed control and global Nash equilibrium is proposed.Second,inspired by the original work in the field of graphical game (Vamvoudakis et al.,2012) and dynamic game theory (Zhu and Ba?ar,2015),two compromised schemes are proposed in this paper:the local Nash game solution using a strict best response method and the global Nash solution with respect to modified index functions.Third,an off-policy RL algorithm is investigated to design model-free controllers for both scenarios with the stability properties proved.
Compared with existing work,the main contributions of this paper are as follows:
1.This paper proposes a general game based framework for the MAS cooperative synchronization control problem,where the optimum or equilibrium is considered in the controller design process instead of stability only.
2.The distributed and Nash properties of the game solution are discussed in detail.It is proved that they may not hold at the same time compared with simplified cases under immediate neighborrelated assumptions (Li et al.,2017;Zhao JG,2020;Wang MY et al.,2021).
3.A systematized scheme consisting of local and modified global cases is proposed with guaranteed Nash equilibrium using a data-driven method.An off-policy RL for CT MASs is derived to solve this problem.Compared with the conventional model-based and existing on-policy RL methods in Vamvoudakis et al.(2012),this paper provides an adaptive and model-free method that relaxes the dependence on the system model.
Let vec() represent a vector expanded from a matrix with respect to its columns.INdenotes an identity matrix with dimensionN.The notation“?”represents the Kronecker product.?xf(x)stands for the gradient offalong parameterx.
A graph consisting ofN+1 nodes can be represented byG={S,W}.S={s1,s2,...,sN+1}is the node set and the notation (si,sj) denotes a directed path from a parent nodesito a child nodesj.The connectivity between nodes is described by nonnegative weightswij,while the associated adjacency matrix is constructed asW=[wij]∈R(N+1)×(N+1).The weightwij=1 if and only if (sj,si) exists,andwij=0 otherwise.wii=0 holds since there exists no self-loop inG.LetNistand for the set of neighbors of nodesi.The Laplacian matrix representing the topology interactions can be defined byL=D-W,wheredenotes the in-degree matrix of the graph.Moreover,if there is a root node that has at least one directed path to every other node,the graph is said to contain a spanning tree.
Consider MASs that consist of one leader andNfollowers.The leader is labeled 0,and the followers are labeled by a setF={1,2,...,N}.The leader is defined as an agent with no neighbor,while each follower has at least one neighbor.Then the Laplace matrix can be written as,where matrixL1∈RN×N,andL2∈RN×1represents the interactions from the leader to followers.Define the in-degree matrix among the followers asD1=diag(dF1,dF2,···,dFN)∈RN×N.
Assumption 1The interactions among followers are strongly connected and there exists a spanning tree in graphGwith the leader being a root node.
Lemma 1Under Assumption 1,all the eigenvalues ofL1have positive real parts and there exists a positive definite matrixΩ=diag(ω1,ω2,···,ωN)that satisfies.
The dynamics of followeri ∈Fis given by
wherexi(t)∈Rnandui(t)∈Rmrepresent the state and control input vectors,respectively.A ∈Rn×nandB ∈Rn×mare a follower’s unknown dynamic and input matrices respectively,where the pair(A,B) is stabilizable.
The dynamics of the leader is defined as
wherex0(t)∈Rnis the state of the leader.The leader given in Eq.(2) generates a reference trajectory with which the followers synchronize.
In the graph with local interactions,a neighboring synchronization error that can be directly obtained by followeriis defined as
Differentiating Eq.(3) along with Eqs.(1) and(2)yields the neighboring error dynamics:
Definition 1(Synchronization control) For MASs with any given bounded initial conditions,if there exists a distributed controllerui(t)=ui(ξi(t)) for followerisuch that
then the cooperative synchronization control is said to be accomplished.
“You are a fine fellow to go gadding16 about in this way,” said she to little Kay, “I should like to know whether you deserve that any one should go to the end of the world to find you.”
Definition 2(Multi-agent differential game) In the MAS synchronization process,followerimaintains a local index functionJi(t) and there exists a corresponding value functionVi(t) given in the following form:
where the notation“-i”denotes the neighbors of theithfollower.Then the MASs are said to fulfill the differential game when each agent pursues the local optimal controller:
Definition 3(Nash equilibrium(Ba?ar and Olsder,1982))
1.Local best response
The local best response controllerof theithfollower corresponding toJianduj,j /=i,is defined to satisfy
2.Global Nash equilibrium
Consider the unified solution of the differential game with index functionsand controller tuples.The MAS is said to achieve the global Nash equilibrium under the condition
Remark 1The purpose of this study is to unify the synchronization control in a differential game frame.This problem is multifaceted due to the coupled index functions which are defined using the neighboring error and control of their immediate neighbors.Similar to the optimal solution in single-agent cases,the Nash equilibrium provides an equivalent concept for the multi-agent differential game;i.e.,the best response of each individual follower forms the global Nash solution.SinceL1is nonsingular according to Lemma 1,ξ(t) andζ(t) have the same convergence.Thus,the graphical game in this study is constructed based onξ(t).
Consider the local index function defined in Eq.(7)along the following common quadratic form:
For any arbitrary finite value function,differentiating Eq.(7)along the system dynamics yields the Bellman equation and the Hamiltonian function:
whereVistands for the value function corresponding to the currentuiandu-i.Using the stationary condition gives the local best response as
Then one can obtain the coupled HJ equation by substituting Eq.(13)into Eq.(12):
Note that there is a set ofNequations of Eq.(15) that require a uniform solutionP.To simplify the analysis,adding the equations with the indexifrom 1 toNyields a reduced necessary condition:
Because (L1?In)diagis usually not symmetric due to the different values of in-degreedk,k ∈F,Eq.(16) is not a typical algebra Riccati equation and may be difficult or even impossible to solve.This indicates that the Nash solution may not exist.Moreover,as an assumption that has been commonly made in the literature,consider that each controller uses only its respective neighboring information.Then it can be further inferred that?ξiViis related only toξiso thatP=diag(P1,P2,···,PN)is a block diagonal matrix,wherePiis a symmetric positive-definite matrix that satisfies.Then the necessary condition can be equivalent to
Since there exists no solution tuple for coupled Eq.(17) according to the positive-definite property ofPi,the requirements of being distributed and the global Nash cannot be fulfilled at the same time.
Remark 2Because the index function (11) is defined in quadratic form,it is reasonable to assume the value function to be quadratic.There is no block diagonal or symmetric solution,which indicates that the distributed and global Nash solution does not exist.One can see from Eq.(16) that the coupled termL1?Inprevents the solution from being solved.To dig deeper,the coupled term is introduced by bothξiandujin Eq.(11),whereas the value function is related to not only the immediate neighbors,but also“neighbors’neighbors.”In addition,the reduced index function withoutuj(Li et al.,2017) cannot solve this problem due to the use ofξi.Thus,the key issue lies on how to cover or decouple the graphical game,which is the motivation of this study.
Inspired by the zero-sum game mechanism,in this subsection we propose a local differential game solution using the best response method.Off-policy RL for CT systems based on the state-action function is derived for stable controller design.
Consider the local index function defined in the following quadratic form:
The Hamiltonian can be written accordingly as
Now the local differential game regardsuj,j ∈F,as extra players to help form the local best response results,while the corresponding solutions can be given by
whereuijis the local value ofujin theithgame.In the local scenario,the value function is assumed to be local-dependent because.Note that HJ equations independently hold with any arbitrary trajectory ofξi.Substituting Eq.(20) into Eq.(19)yields the decoupled equation:
Theorem 1Consider the differential MAS graphical game with dynamics (1) and (2) and index function (18).The local best response solution,whereis derived in Eq.(21),produces the following results:(1) The local Nash is achieved for the differential game;(2) The distributed controller givesL2stability against other players;(3) The asymptotical stability of the MAS synchronization error is guaranteed.
Integrating both sides of inequiality(27)yields
This means thatgives a bounded response under anyuij;i.e.,theL2stability is realized.
The localL2property cannot lead to global asymptotic stability for MASs.For each agent using the local controller,the closed loop dynamics can be obtained by
Without loss of generality,it is assumed thatRiiare uniformly selected as.Then the augmented system can be written as
Remark 3Theorem 1 gives a sufficient condition of asymptotic stability,whereas uniformly choosingRiiaims only to simplify the proof.This selection is reasonable due to the redundant design variables.To be specific,the controller can be tuned byQiand the interaction weights can be compensated for byRij.It is worth pointing out that inequality (26) holds only in the case of local best response.Because the best responsesandare incompatible,the local Nash does not lead to the global Nash.
It can be seen that the coupled HJ equations(19) and (21) involve a complex solution process and require extra communication of other followers’parameters.An off-policy RL algorithm is proposed to provide an online solution using system data.
Denoteandas the updated controllers andas the value function in thekthlearning phase,andas the best response controllers related to.According to the property of the Hamiltonian function deduced in Eq.(23),the following equation holds for any arbitrary (ui,uij),j ∈Ni:
Then the off-policy recursive equation can be derived by integrating Eq.(33)through time intervalTas
One can see that the left-hand side of Eq.(34)contains value functionand action pairsandwhich act as the action-dependent Qfunctions of a CT system.Employ NNs to approximate the value function and the best response controllers as
whereWi1∈Rp1,Wi2∈Rp2,Wij2∈Rp3are the weights,andφi(ξi)∈Rp1,φi(ξi)∈Rp2,φij(ξi)∈Rp3represent the NN basis functions.Then write the cross terms in the linear form as
Then the approximate error of Eq.(34)is given by
The NN update law is designed by reducing the approximation errorusing the gradientdecent method as
The detailed learning procedure is summarized in Algorithm 1.
Remark 4The two-loop structure of Algorithm 1 has the same convergence property as that of the policy iteration method;the inner loop equivalently solves the Bellman equation(34)and the outer loop updates the controller to its optimal solution.In practice,Eq.(39) can be launched with an independent rate higher than the data collection rate.The algorithm combines the Q-learning mechanism and the integral RL for a CT system.By deriving the action-dependent value function according to Eq.(34),the model-free and off-policy properties are guaranteed compared with other methods.Thus,the controller pair to be updated,,and the input pair driving the system,,can be different.The collected data can be stored in a data buffer and reused during the learning process,which helps improve the convergence with higher data efficiency.In the learning phase,each follower can be initialized with an arbitrary stable local or distributed controller recorded as.This avoids the contradiction betweenujanduij.
To tackle the unsolvable problem in Section 3.1 and the local Nash restriction in Section 3.2,a general index function for a differential graphical game is proposed to achieve both distributed and global Nash control in this subsection.
Define the following modified index function in a general case:
SijandQijare parameter matrices chosen to decouple the differential game and are to be determined later.Compared with earlier results,one can see from Eq.(40)that the neighboring errorξjof thejthagent is included in the quadratic form,which gives a more general case.
Theorem 2Consider the differential game of MASs with dynamics (1) and (2) and index function (40).Under the sufficient condition thatis given with the following elements:
the value function of each follower can be decoupled in a distributed form,while the corresponding distributed controllermaintains two properties:(1)global asymptotic stability for synchronization control and (2) achievement of the global Nash equilibrium.
ProofSimilarly,the Hamiltonian function can be rewritten as
The value function corresponding toandis supposed to be.Then the distributed controller in the best response form satisfies the stationary condition:
Substituting Eq.(43) into Eq.(42) and rearranging the terms give the HJ equation:
Apparently,when extra parameter matrices are chosen by Eq.(41),the coupled terms can be eliminated and Eq.(44)is equivalent to the local algebra Riccati equation:
This indicates that the HJ equation is solvable compared with Eq.(17) and that the solution controller is in a distributed form.Note that the controllerfor the local agent has the same form as in Section 3.2.The stability analysis and condition are similar to those in the local case and are omitted here.
Consider the Hamiltonian functionrelated to controllers,which satisfies
Therefore,the optimal controllers given by Eq.(45) build the global Nash equilibrium.This completes the proof.
Since the construction of index functions (40)and (41)involves the equilibrium value,whereas the RL method requires online feedback data from index functions,a contradictory problem arises.Note that the final global Nash solution is equivalent to that in the uncoupled case.To this end,the RL procedure can be constructed in a parallel way as the single-agent case while regarding the neighbor’s input as a known vector.The NN approximations are given asThe NN approximation error is given by
The remaining steps are similar to those in Algorithm 1 and are omitted here.
Remark 5In summary,the local best response solution considers interactions virtually in the worst case,which indicatesL2stability in inequality (28).The global Nash solution guarantees the unified equilibrium in Eq.(45)by directly handling real interactions.Although the above two cases are derived under divergent bases,they can be solved uniformly and systematically by the proposed RL-based method because of the same formulation structure.
Consider a multi-agent system consisting of one leader labeled 0 and four followers labeled from 1 to 4 with the interaction topology demonstrated in Fig.1.To include as many attributes as possible,the followers are designed with different configurations.The number of neighboring players and in-degree are given by pairs as(0,1),(1,2),(1,1),(2,2).
Fig.1 Communication topology of a multi-agent system (MAS)
The dynamics of MASs is chosen to be naturally unstable as
1.Solution in the local best response case
The theoretical results of the nominal controllers are listed as follows to show the validity of the RL method and local Nash properties:
One can see that even if the parameters are the same,divergent graph settings bring different solutions.The synchronization error of each agent is given in Fig.3a to illustrate the stability of local best response based controllers.To further verify the local Nash property stated in Eq.(34),theL2gainsfor agents 2,3,and 4 are suppressed under threshold 1.0 as shown in Fig.4a.Agent 1 is omitted here because it has no neighbor in the follower set.
Fig.2 Convergence curves of the neural network (NN) weights in the local best response case:(a) agent 1;(b) agent 2;(c) agent 3;(d) agent 4
Fig.3 Multi-agent system (MAS) synchronization errors in the local best response case (a) and the global Nash case (b)
Fig.4 Evolution of index functions in the local best response case (a) and the global Nash case (b)
2.Solution in the global Nash case
The index functions of the global Nash case are set withQi=5I2,Rii=1,R24=R31=R42=R43=0.1.Conduct the RL procedure in the same manner as in the single-agent case,where the coupled neighboring control inputs can be regarded as off-policy signals.Taking agents 1 and 3 as examples,the convergence of learning procedures is demonstrated by the error curves of NN weights to their target values in Fig.5.The distributed global Nash controller and the compensational terms of modified index functions can be given as follows based on the learning results:
Fig.5 Convergence error curves of neural network (NN) weights in the global Nash case:(a) value function weight error;(b) controller weight error
Similarly,the stabilities in the global Nash case are verified by demonstrating the synchronization error in Fig.3b.
Moreover,F(xiàn)ig.4b shows the evolution of index function (40) for global controllers in Section 3.3.The solid lines represent the evolution curves generated by the global Nash equilibrium controllers.The dashed lines denote the results of the above controllers with random biases.As compared in the diagram,one can see that the global Nash controller for each agent gives a smaller index value than the biased controller,which conforms to the global Nash equilibrium.
This paper studied the cooperative synchronization control of MASs from a differential game perspective.The solution was highly coupled and may not exist in general cases.The local best response controller and global Nash controller with modified index functions were successively investigated to deal with the coupling issues.An off-policy RL method was proposed to solve the problem online in a data-driven manner.The configurations of graph topology and system dynamics affected the solutions.Thus,to extend the results to more complex scenarios,such as the existence of switching topologies and disturbances,is quite a significant direction for future research.
Contributors
Yu SHI designed the research,conducted the simulations,and drafted the paper.Yongzhao HUA and Jianglong YU helped organize the paper.Xiwang DONG and Zhang REN revised and finalized the paper.
Compliance with ethics guidelines
Yu SHI,Yongzhao HUA,Jianglong YU,Xiwang DONG,and Zhang REN declare that they have no conflict of interest.
Frontiers of Information Technology & Electronic Engineering2022年7期