亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

Hybrid Q-learning for data-based optimal control of non-linear switching system

2022-11-01 07:59:42LIXiaofengDONGLuandSUNChangyin

Journal of Systems Engineering and Electronics 2022年5期

LI Xiaofeng ,DONG Lu ,and SUN Changyin,*

1.School of Automation,Southeast University,Nanjing 210096,China;2.School of Artificial Intelligence,Anhui University,Hefei 230601,China;3.School of Cyber Science and Engineering,Southeast University,Nanjing 211189,China

Abstract: In this paper,the optimal control of non-linear switching system is investigated without knowing the system dynamics.First,the Hamilton-Jacobi-Bellman (HJB) equation is derived with the consideration of hybrid action space.Then,a novel data-based hybrid Q-learning (HQL) algorithm is proposed to find the optimal solution in an iterative manner.In addition,the theoretical analysis is provided to illustrate the convergence and optimality of the proposed algorithm.Finally,the algorithm is implemented with the actor-critic (AC) structure,and two linearin-parameter neural networks are utilized to approximate the functions.Simulation results validate the effectiveness of the data-driven method.

Keywords: switching system,hybrid action space,optimal control,reinforcement learning,hybrid Q-learning (HQL).

1.Introduction

In contrast to the conventional non-linear system,the dynamics of switching system can be described by an interaction of a discrete switching policy and several continuous subsystems [1].The properties of stability,controllability,and observability have been well studied in existing literatures [2,3].Besides stability,optimality is more preferred for designing controller of real-world applications.The optimal control problem aims to find an admissible policy that can stabilize the controlled system as well as optimizing the predefined performance [4].In general,the optimal solution can be obtained by solving the corresponding Hamilton-Jacobi-Bellman (HJB) equation.The family of classical methods includes the classical variational method,Pontryagin’s maximum principle,and dynamic programming [5].In particular,as for discrete-time dynamics systems,dynamic programming method has been successfully applied in many fields of engineering.However,it suffers from the “curse of dimensionality” problem so that the computation cost is very high with the increasing of system dimensions [6].

In recent years,the optimal control problem of switching systems has attained much attention since many realworld applications from aerospace systems to traffic signal control system can be addressed as switching organisms [7-10].In general,the related work can be divided into two categories.The switching system with autonomous subsystems has attained much attention from researchers.Without considering the control input,the task is simplified to find the optimal switching scheduling.A kind of gradient projection-based methods is proposed for general continuous-time non-linear hybrid systems.The local minima of cost function is found along the direction of the gradient [11,12].In [13],researchers considered the optimal scheduling problem of linear switching systems with pre-specified mode sequence.The optimal switching time instances are determined by using calculus of variations method.Note that it is required to fix and know the active mode sequence for these non-linear programming-based methods.So the planning process should be re-computed if the initial state is changed.

As for the switching system with controlled subsystems,it is required to co-design the switching policy and the control policies of subsystems to optimize the performance function.In [14],a direct search scheme based on the Luus-Jakola optimization technique was proposed to address the optimal control of general switched linear quadratic systems.Also,the sequence of active mode is pre-fixed which simplifies the control problem of nonautonomous switching systems.In addition,a two point boundary value differential algebraic equation (DAE) is solved to explore the optimal solutions numerically [15].In [16-18],discretization-based algorithms were proposed which divides the state and input space with a finite number of options.However,the above planning based algorithms also suffer from high computation cost and limited range of initial state.

Recently,the reinforcement learning (RL) method has been utilized to learn the optimal policy of Markov decision process (MDP) by interacting with the environment[19-21].The actor-critic (AC) structure is commonly employed to implement the algorithm,where the critic network approximates the value function and the actor network approximates the control policy [22-24].Value iteration [25] and policy iteration [26] are two typical classes of model-based methods which require to know the accurate system dynamics.In [27],researchers proposed a model-free algorithm for the optimal control of unknown non-linear system by pre-training a model network.In addition,a series of data-based schemes were proposed to learn the optimal policy completely based on interactive data [28-31].Considering its adaptive property and feedback formulation,a novel RL scheme is proposed to determine the optimal scheduling for switching systems which achieves good performance.In [32],the problem of multi therapeutic human immunodeficiency virus treatment was formulated as to find the optimal solution of a finite-horizon autonomous switching system.The optimal value function was learned by using the value iteration (VI) based method.Then,the decision can be made by simply comparing several scalar values.Moreover,researchers extended this work to general autonomous nonlinear switching systems with rigorous convergence proof [33].In addition,switching cost penalty and minimum dwell time constraint were considered [34,35].Note that the systems in [32-35] all take the finite-horizon objective functions with terminal state constraints.However,the controlled switching system is rarely studied for RL control design.In [36],researchers proposed a model-based algorithm to co-design the optimal policy of the non-linear switching system with control constraint.In [37],a neural network was first trained to learn the model and an iterative algorithm was designed to generate a sequence of Q-functions which finally converges to the optimal solution.

Considering the complex dynamics of controlled switching system,it is rather difficult to obtain the exact system dynamics.While the model can be identified by training a model network,the model error can not be neglected.In this paper,a novel hybrid RL algorithm is proposed to learn the optimal policy of non-linear switching systems.The main contributions are as follows.

(i) Considering the hybrid policy of discrete switching signal and continuous control input,the corresponding HJB equation is constructed based on the Bellman’s optimality principle.

(ii) An iterative RL algorithm is proposed to find the optimal hybrid policy without knowing the system dynamics or pre-training the model network.

(iii) The convergence proof of iterative Q-functions is provided.

The rest of this paper is organized as follows.In Section 2,we first analyse the hybrid nature of action space and derive the transformed HJB equation.Section 3 proposes the design of hybrid RL algorithm as well as the detailed implementation steps with AC structure.The convergence proof is given in Section 4.In Section 5,two numerical examples are provided to demonstrated the performance of the proposed method.Finally,conclusions are provided in Section 6.

2.Problem formulation

Consider the general non-linear switching system with the following dynamics:

wherexk∈Ωx?Rnanduk∈Ωu?Rmdenote the system states and control parameters,respectively.The subscriptkdenotes the index of time step.Both Ωxand Ωuare compact and connected sets.The notationvdenotes the index of subsystem and there are a number ofPsubsystems.The notation P={1,2,···,P} denotes the set of available subsystems.It is assumed thatfv:Ωx×Ωu→Ωxis Lipschitz continuous withfv(0,0)=0.

In contrast to the conventional non-linear systems,the controller of switching system need to co-design the switching signal and control input.Consequently,the control signal at each time step is a tuple of (v,uv) where the subscript ofuvdenotes the coupling between active mode and control input.Then,the action space can be formulated by

Afterwards,the performance function is defined as follows:

where the cost function is defined byU(x,v,u)=xTQvx+uTRvuwhereQv∈Rn×nandRv∈Rm×mare positive definite matrices.

Letπv(x) andπuv(x) denote the policies of discrete action and its corresponding continuous parameters,respectively.For notation clarity,we utilizeπ(x) to represent the hybrid control policy,i.e.,π(x)=(πv(x),(x)).In order to derive the algorithm,we first introduce a Q-function with respect to any given policyπ(x)as follows:

That is to say,the Q-function denotes the accumulated costs if the system starts in statexand takes an arbitrary hybrid action (v,u),and then taking hybrid actions generated by the hybrid policyπ(x) thereafter.

Afterwards,based on the Bellman’s optimality principle [38],the corresponding HJB equation can be obtained:

whereQ*(x,v,u) denotes the optimal Q-function ofπ*(x).For notation simplicity,we letx,v,andudenote the current state,discrete action,and continuous parameters whilex′,v′,andu′ denote the state,discrete action,and continuous parameters at the next time step,respectively.

3.Hybrid Q-learning algorithm and its convergence analysis

In order to co-design the policies of switching signal and control input,a novel hybrid Q-learning (HQL) algorithm is proposed in this section.In addition,the implementation details of AC structure is provided by using linear-in-parameter (LIP) neural network (NN) as function approximator.

3.1 HQL algorithm

The algorithm starts with the initial Q-functions,i.e.,=0,?v∈P . For eachv∈P,its corresponding continuous parameters policy can be obtained by taking the infimum over Ωu:

Then,the Q-function can be updated by

Fori=1,2,···,one iterates between

Consequently,the HQL algorithm generates a sequence of Q-functionthat will converge to the optimal solution of (5).Once the optimal Q-function is obtained,the optimal continuous policy can be computed by substitutinginto (8) while the optimal discrete action can be simply determined be comparing the value of different Q-functions.The convergence proof is given in Section 4.

3.2 Implementation with the AC structure

Consider the continuous state space,and LIP NNs are employed as function approximators.Specifically,for each mode,there exists a corresponding actor network and critic network.

LetQv(x,v,u;Wc,v) denote the output of the critic network so that

and let μv(x;Wa,v) denote the output of the actor network so that

whereWc,vandWa,vdenote the weights of the critic and actor networks,respectively.In addition,φv(·) andσv(·)represent the activation function of the critic and actor networks,respectively.

Let D denote a data buffer with memory sizeM.To begin with,the HQL algorithm needs to sample a few transitions from the state and hybrid action spaces and stores them into D.Specifically,according to uniform random distribution,we sampleMstates from ΩxandMparameters from Ωu.By substituting (x,v,u) into (1),one can receive the corresponding cost functionUdand next statexd+1.Then,the tuple of transitionsare stored into D.Note that although with the samexdandud,by selecting differentvd,one can obtain differentUdandxd+1so that all subsystems can be explored sufficiently.

First,the critic networks are initialized with(x,v,u;Wc,v)=0.For each mode,a batch of transitionsare randomly sampled from D,whereBdenotes the batch size.According to (8),for any iterationi,the target value of actor network is

Then,by using the least square method (LSM),the weights of actor network can be computed by

Afterwards,the target value of critic network can be obtained by

Consequently,by using LSM,the weights of actor network can be computed by

Remark 1By using the LIP NNs with linear independent polynomial basis functions,the weights can be updated as the one shot solution based on the LSM at each iteration step.Since the training is an iterative process,this can significantly accelerate the convergence procedure.In addition,it is worth noting that the proposed algorithm is not limited to LIP NNs,one can utilize multilayer perceptrons or even deep NNs,for improving the approximation capability of the NN.

Motivated by [20],the target networks are utilized to stabilize the training process.The detailed implementation steps of HQL algorithm are given in Algorithm 1.

4.Convergence analysis

The convergence proof derived by extending the theoretical analysis in [36].Before proceeding,the following definition and assumption are given.

Definition 1[36] The hybrid policy (πv(x),πuv(x)) is defined to be admissible within Ωxif there exists an upper bound Z(x) for its performance function

According to Lemma 1,we have

5.Numerical analysis

Two simulation examples are provided to evaluate the performance of the HQL algorithm.The code is run on Matlab 2018a with Intel Core i7 3.2 GHz processor.

(i) Example 1

First,the HQL algorithm is applied to a linear switching system with two modes:

To begin with,for each subsystem,500 transitions are randomly sampled from the state and action spaces.During the iteration process,a batch of 300 samples are randomly selected from the data buffer to train the networks.The maximum iteration number is 100 and the training process will be completed if≤10-6,?v∈{1,2}is satisfi ed.The evolution process of the critic network weights are shown in Fig.1 and Fig.2 which verifies the convergence proof.It is shown that the elements converge after five iteration steps.The training process takes 0.619 9 s.

Fig.1 Evolution process of critic network weight Wc,1

Fig.2 Evolution process of critic network weight Wc,2

Let the initial state bex0=[5,-5]T.The trajectories of system states and hybrid control input under the trained policy are given in Fig.3-Fig.5,respectively.The states converge to the origin after six time steps.

Fig.3 Trajectory of system state with x0=[5,-5]T

Fig.4 Trajectory of continuous parameter with x0=[5,-5]T

Fig.5 Trajectory of discrete action with x0=[5,-5]T

(ii) Example 2

Next,a non-linear scalar system [34] is selected:

The domains of interest are selected asΩx={x∈R:|x|≤3} and Ωu={u∈R: |u|≤5}.In addition,the system is discretized by using Euler method withΔt=0.005 s .The cost function is defined asU(x,v,u)=x2+u2.The activation functions of the LIP NNs are

To begin with,for each subsystem,300 transitions are randomly sampled from the state and action spaces.During the iteration process,a batch of 200 samples are randomly selected from the data buffer to train the networks.The maximum iteration number is 100 and the training process will be completed if{1,2}is satisfied.

The evolution process of the critic network weights are shown in Fig.6 and Fig.7 which verifies the convergence proof.It is shown that the elements converge after 150 iteration steps.The training process takes 1.729 1 s.

Fig.6 Evolution process of critic network weight Wc,1

Fig.7 Evolution process of critic network weights Wc,2

Let the initial state bex0=3 and apply the trained hybrid controller for 2.5 s.The trajectory of system state is shown in Fig.8-Fig.10 show the trajectories of control input and switching signal,respectively.

Fig.8 Trajectory of system state with x0=3

Fig.9 Trajectory of continuous parameter with x0=3

Fig.10 Trajectory of discrete action with x0=3

6.Conclusions

In this paper,a novel hybrid reinforcement learning with the AC structure is designed to find the co-designed optimal policy of the controlled switching system.The generated Q-functions will iteratively converge to the optimal solution of the derived HJB equation without knowing or identifying the system dynamics.The effectiveness of our algorithm is verified by two numerical examples.

Journal of Systems Engineering and Electronics2022年5期

Journal of Systems Engineering and Electronics的其它文章: Joint optimization of inspection-based and age-based preventive maintenance and spare ordering policies for single-unit systems; Research on virtual entity decision model for LVC tactical confrontation of army units; Improved adaptively robust estimation algorithm for GNSS spoofer considering continuous observation error; Design and simulation of the ATP system considering the advanced targeting angle in quantum positioning system; Impact angle constrained fuzzy adaptive fault tolerant IGC method for Ski-to-Turn missiles with unsteady aerodynamics and multiple disturbances; Maneuvering target state estimation based on separate modeling of target trajectory shape and dynamic characteristics