亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

        ?

        Heterogeneous multi-player imitation learning

        2023-11-16 10:12:52BosenLianWenqianXueFrankLewis
        Control Theory and Technology 2023年3期

        Bosen Lian·Wenqian Xue·Frank L.Lewis

        Abstract This paper studies imitation learning in nonlinear multi-player game systems with heterogeneous control input dynamics.We propose a model-free data-driven inverse reinforcement learning(RL)algorithm for a leaner to find the cost functions of a N-player Nash expert system given the expert’s states and control inputs.This allows us to address the imitation learning problem without prior knowledge of the expert’s system dynamics.To achieve this,we provide a basic model-based algorithm that is built upon RL and inverse optimal control.This serves as the foundation for our final model-free inverse RL algorithm which is implemented via neural network-based value function approximators.Theoretical analysis and simulation examples verify the methods.

        Keywords Imitation learning·Inverse reinforcement learning·Heterogeneous multi-player games·Data-driven model-free control

        1 Introduction

        Imitation learning is a popular approach where a learner mimics the demonstrated behaviors of an expert, based on limited samples of the expert’s demonstrations.It is commonly addressed through two main approaches: behavioral cloning[1,2]and inverse reinforcement learning(RL)[3,4].The former tackles the problem by learning a policy directly from expert demonstrations,treating it as a supervised learning task.In contrast, inverse RL focuses on reconstructing the underlying objective function, such as cost or reward functions,that best explain the expert’s demonstrations.The objective function serves as a higher-level index,providing transferability and robustness in control problems.Notably,inverse RL requires significantly fewer data[5]compared to behavioral cloning[6],making it more efficient in terms of sample utilization.The success of inverse RL is shown in a wide range of applications such as autonomous driving [7,8],anomaly detection[9,10],and energy management[11].

        Inverse RL and inverse optimal control(IOC)share fundamental principles and are often considered as closely related concepts.The conventional IOC methods in control systems [12] typically rely on system dynamics to solve the objective function.In contrast, inverse RL techniques in the machine learning field can construct objective functions using only demonstrated expert trajectories,without explicitly requiring system dynamics.Notably,model-free inverse RL algorithms have been developed for single-player and multi-player Markov decision processes[13–16].Inspired by the success of inverse RL and IOC, researchers have combined these approaches to develop model-free data-driven inverse RL methods for control systems with single-player input[17,18]and two-player games[19–22].

        Control problems involving multiple inputs and players are prevalent in practical applications.Multi-player games[23]provide a framework for modeling the decision-making process of multiple players who interact with each other.In these games, each player, corresponding to the control input, seeks to find a Nash equilibrium solution in non-zero-sum games.To compute such a solution for each player, two primary online computational approaches have emerged:model-free off-policy RL[24–27]and online datadriven actor-critic methods [28–30].Building upon these developments, data-driven inverse RL has been studied for multi-player games[31,32].

        This paper studies the learner-expert imitation learning problem in the context of nonlinear heterogeneous multiplayer game systems.Specifically,we consider the scenario of non-zero-sum games and propose a data-driven inverse RL approach for the learner to find the cost functions of all players given expert demonstrations without knowing system dynamics.The contributions of this paper are twofold.First,we propose a data-driven model-free inverse RL algorithm without knowing any system dynamics.This is in contrast to previous work [31] that requires knowledge of the input dynamics of all heterogeneous players.Second,we address a practical limitation present in previous approaches[31,32],which required the learner and expert to have the same initial states in the learning process.Whereas,this paper removes this requirement.

        2 Notations and preliminaries

        2.1 Notations

        R, Rn, and Rn×mdenote the set of all real numbers,n-column vectors, andn×mmatrices, respectively.‖·‖defines the 2-norm.Forx∈ Rnandy∈ Rm,x?y? [x1y1,...,x1ym,x2y1,...,x2ym,...,xn ym]T.ForA=(ai j) ∈Rn×m, vec(A) ?[a11,...,a1m,a21,...,a2m,...,anm]T.For a symmetric matrixA=(ai j)∈Rn×n,vem(A)?[a11,a12,...,a1n,a22,...,an-1,n,ann]T.λmindenotes the minimum eigenvalue.Indenotes the identity matrix of sizen.

        2.2 Expert system

        Consider theN-player expert system as

        wherexe(t)∈Rnis the expert state,u je(t)∈Rm jis the optimal control input of expert playerj,j∈N?{1,2,...,N}associated with some underlying cost function.f(xe) andg j(xe)denote expert’s system dynamics.

        Assumption 2 Assumegi/=g jfori/=jin (1), wherei,j∈N.

        Assumption 2 defines the multi-player game systems with heterogeneous control inputs.

        Assume the underlying cost function that each expert playerjminimizes is

        According to optimal control theory[33],the optimal control inputsu jeof the expert system(1)that minimize the cost functions(2)are

        where ?Vje(xe(t)) satisfy the coupled Hamilton–Jacobi equations

        2.3 Learner system and imitation learning problem

        Consider aN-player learner system

        wherex∈Rnandu j∈Rm jare the learner state and control input of playerj∈N, respectively.The dynamics of the learner are the same as the expert’s.Let us set the cost function for playerjas

        Similar to the optimal control process in (1)–(4), the learner’s optimal control inputs that minimize(6)are

        where ?Vj(x) satisfy the coupled Hamilton-Jacobi equations

        If each learner player imitates each learner player’s behavior, then both systems will have the same trajectories.To achieve this goal,we propose the following assumptions and the imitation learning problem.

        Assumption 3 The learner knowsq(·) and its own weight coefficientsQ jandR ji,i,j∈Nin(6),but does not know the expert’s weight coefficientsQ jeandR(ji)e,i,j∈Nin(2).

        Assumption 4 The learner has access to the measurement data of the expert’s state and inputs,i.e.,(xe,u je),j∈N.

        It is known that different weight coefficient pairs can yield thesameoptimalcontrolsinIOCproblems[34,35].We,thus,give the following equivalent weight definition from[31].

        Definition 2 AnyQ j> 0 ∈Rn×nandR ji> 0 ∈Rmi×mi,i,j∈Nin (8) that make (7) yieldu?j=u jeandx=xewhereu jeis defined in(3)are called equivalent weights toR(i j)eandQ je.

        To guarantee the existence of the above equivalent weight givenf, heterogeneousg j,j∈N, andq, we make the following assumption.

        Assumption 5 For the system with dynamicsq(·),f(·)andg j(·),there exists at least one equivalent weight toQ je≥0 with the selectedR ji>0,i,j∈N.

        Heterogeneous multi-player imitation learning problem Under Assumptions 1–5,given anyR ji>0,the learner aims to infer an equivalent weight toQ jefor (6) for each playerj∈Nwithout knowing system dynamics,such that it mimics the expert’s demonstrations,i.e.,

        3Model-based inverse RL for heterogeneous multi-player games

        To solve the imitation learning problem of heterogeneous multi-player game systems, this section presents a modelbased inverse RL algorithm to find an equivalent weight ˉQ jgivenR ji>0usingexpertdemonstrations(xe,u1e,...,uNe).The algorithm has a two-loop iteration structure,where the inner loop is the RL process to learn optimal control and the outer loop is the IOC process to update weight coefficients.

        3.1 Algorithm design

        We first design a model-based inverse RL Algorithm 1 presented as follows.It has two iteration loops.We selectR ji> 0 and keep them fixed.The inner iteration loop, (9)and(10),is RL process to update the value function and control input for each playerjthat are optimally associated with the current weight coefficientsQsjandR ji> 0.The converged solutionsusjare the optimal solutions to (7)–(8) for each outer-loop iterations.Then, the outer loop iteration,i.e., (11) based on IOC, is to updateQsjtoward the equivalent weight ˉQ jusing the demonstrated expert trajectories data(xe,u je).Note thatρ j∈(0,1]is the tuning parameter that can adjust the update speed.

        3.2 Convergence and stability

        Theorem 1Algorithm1solves the imitation learning problem of heterogeneous multi-player systems by learning uniformly approximate solutions of the equivalent weightˉQ j,j∈N at limited iteration steps.

        ProofThe inner loops are the standard RL policy iterations.The convergence of it at any inner iteration loops can be referred to [36].The policy iteration is proved to be quasi-Newton’s method in a Banach space[37–39]that yields the converged(usj,Vsj),whereVsjis the optimal value andusjis the optimal control input at outersloop.

        We now prove the convergence of outer loops.(11)implies that

        Algorithm 1 Model-based inverse RL for heterogeneous multi-player games 1: Initialization:Select R ji >0,initial Q0j >0,stabilizing u00j ,small thresholds e j and ε j for all j ∈N,and ρ j ∈(0,1].Set s =0.2: Outer s-th iteration 3: Inner k-th iteration:Set k =0.4: Solve value functions by 0=qT(x)Qsjq(x)+Nimages/BZ_23_579_506_616_541.png(usk images/BZ_23_853_493_873_529.pngimages/BZ_23_1001_493_1021_529.pngTimages/BZ_23_1037_502_1045_537.pngimages/BZ_23_1045_493_1065_529.pngimages/BZ_23_1352_493_1372_529.pngi )TR jiuski +?Vsk j (x)f(x)+Nimages/BZ_23_1183_506_1219_541.pnggi(x)usk.i=1 i=1 i (9)5: Update control inputs by us(k+1)j (x)=-1 2 R-1 j j gTj(x)?Vskj (x).(10)6: Stop if‖Vsk j -Vs(k-1)j ‖≤ei,then set usj =uskj ,otherwise set k ←k+1 and go to step 4.7: Outer s-th iteration:Update Qs+1 j using(xe,u1e,...,uNe)by qT(xe)Qs+1 j q(xe)=ρ jimages/BZ_23_613_948_632_984.pngu je(xe)-usj(xe)images/BZ_23_877_948_896_984.pngTR j jimages/BZ_23_980_948_999_984.pngu je(xe)-usj(xe)images/BZ_23_1244_948_1263_984.png+qT(xe)Qsjq(xe).(11)8: Stop if‖usj -u je‖≤ε j,otherwise set u(s+1)0 j =usj,s ←s+1,and go to step 3.

        It shares similar principles of sequential iteration in, for example,numerical algorithms[40].Given initial> 0,(12)implies

        and

        which implies

        4 Model-free inverse RL for heterogeneous multi-player games

        The inverse RL Algorithm 1 needs to know heterogeneous multi-player system dynamicsf,g1,...,gN, which may not be fully known in reality.To address this, we develop a completely model-free data-driven inverse RL algorithm in this section.The algorithm leverages the off-policy RL technique and neural networks.It uses only measurement data of(xe,u1e,...,uNe)to find a ˉQ jthat optimally yieldsuje(3)for the imitation learning problem.

        4.1 Off-policy inverse RL design

        Based on Algorithm 1, we present two steps to design the model-free data-driven inverse RL algorithm.

        Step 1:Find a model-free equation that replaces(9)and(10).Off-policy integral RL[41,42]is used in Algorithm 1’s inner loops.Towards this end,we rewrite(8)as

        whereare the updating control inputs.

        In addition,we have

        Putting(19)into(20)and then considering(9),we have

        Based on[25],we define the following control inputs for the purpose of model-free off-policy heterogeneous multiplayer RL

        Remark 1A simple understanding of theu jiin (22) withj/=iis that they are auxiliary variables to solveu jiin(22)withj=i.Note thatR ji∈Rmi×mi.

        The model-free inverse RL algorithm for imitation learning in heterogeneous multi-player games is shown below.

        Theorem 2Algorithm2has the same solutions as Algorithm1.

        ProofTaking the limit to(24)of the time intervalδtyields

        One has

        Algorithm 2 Model-free inverse RL for imitation learning in heterogeneous multi-player games 1: Initialization:Select R ji >0,initial Q0j >0,stabilizing u00j ,small thresholds e j and ε j,and ρ j ∈(0,1]for all j ∈N.Set s =0.2: Outer s-th iteration 3: Inner k-th iteration:Set k =0.4: Solve value functions and control inputs for each player j by Vsk j (x(t +δt))-Vskj (x(t))-images/BZ_25_711_484_729_520.png t+δt Nimages/BZ_25_810_506_847_541.png(us(k+1)ji )TR ji(ui -uski )dτ (24)images/BZ_25_378_594_396_629.png t+δt t i=1images/BZ_25_474_603_493_639.pngimages/BZ_25_969_603_988_639.png=-Nimages/BZ_25_737_616_774_651.png(usk dτ.t qT(x)Qsjq(x)+i=1 i )TR jiuski j -Vs(k-1)j ‖≤e j,then set usj =us(k+1)j ,otherwise set k ←k+1 and go to step 4.6: Outer s-th iteration:Update Qs+1 j using measurement data of(xe,u je)by(11).7: Stop if‖usj -u je‖≤ε j,otherwise set u(s+1)0 j =usj,s ←s+1,and go to step 3.5: Stop if‖Vsk

        which implies that(24)is equivalent to(9).In addition,the outer loop formulation is the same.This shows that Algorithms 2 and 1 have the same solutions.■

        4.2 Neural network-based implementations

        Solving directly Algorithm 2 is difficult.Therefore, we implement it via neural networks(NNs).We design two NNbased approximators forin (24) andin (22).According to[43,44],the two approximators are defined as

        With approximators (27) and (28), we rewrite (24) with residual errorsas

        With Kronecker product,(29)is rewritten as

        where

        Based on(30),(31),(29)can be solved using batch least squares as

        which can be rewritten as

        where

        To use batch least squares method to solve, we define

        Similarly,we sampleω≥n(n+1)/2 data tuples in(35),such that rank(XTe Xe) =n(n+ 1)/2.We, thus, uniquely solve forby

        5 Simulations

        This section verifies the model-free inverse RL Algorithm 2 with both linear and nonlinear multi-player Nash game systems.

        5.1 Linear systems

        Consider the expert as a 3-player Nash game system with dynamics matrices

        The learner has identical dynamics.

        We assume the cost function weights of the expert are

        Fig.1 a Convergence of control input NN weights to K je.b Convergence of control input NN weights,where i /= j and i, j ∈{1,2,3}

        which yield the optimal control feedback gains(in terms ofu je=-K jexe)

        5.2 Nonlinear systems

        Consider the expert as a 3-player Nash game system with dynamics matrices

        Fig.2 a Convergence of cont function weight .b Trajectories of the expert and the learner using the converged control policies

        Fig.3 a Convergence of control input NN weights .b Convergence of cost function weights ,where j ∈{1,2,3}.c Trajectories of the expert and the learner using the converged control policies

        6 Conclusions

        This paper studies inverse RL algorithms for imitation learning of multi-player game systems where different control input dynamics are different.We propose a model-free data-driven algorithm to reconstruct the cost functions of a demonstrated multi-player system and yield the same expert’s trajectories.The algorithm is implemented via value function approximators.Simulation examples verify the effectiveness of the proposed methods.In the future,we consider to extend the work to inverse RL for multiagent systems with multiple control players.

        亚洲性无码av在线| 亚洲v欧美v国产v在线观看| 欧美巨大性爽| 在线丝袜欧美日韩制服| 国产精品自拍视频免费看| 中文字幕人成人乱码亚洲av| 国产一极内射視颍一| 日韩成人免费一级毛片| 亚洲二区三区四区太九| 免费黄片小视频在线播放| 日日噜噜夜夜狠狠va视频| 亚洲一区二区三区在线网站| 亚洲av乱码国产精品色| 久久精品国产亚洲av不卡国产| 人妻哺乳奶头奶水| 国产亚洲欧美日韩综合一区在线观看| 亚洲国产色图在线视频| 丁香六月久久| 国产精品毛片毛片av一区二区| 刺激一区仑乱| 国产自国产在线观看免费观看| 免费一级a毛片在线播出| 少妇被猛烈进入中文字幕| 成人a级视频在线播放| 久久亚洲中文字幕无码| 无码人妻专区一区二区三区| av一区二区在线网站| 久久精品国产亚洲7777| 国产最新网站| 日韩人妻av不卡一区二区三区| 亚洲伦理第一页中文字幕| 蜜桃无码一区二区三区| 综合91在线精品| 久久精品人妻一区二三区| 久久久久人妻精品一区三寸| 国产内射合集颜射| 日本五十路熟女在线视频| 精品亚洲麻豆1区2区3区| 国产99久久精品一区二区| 亚洲网站免费看| 亚洲日本中文字幕乱码在线|