Xin Chen·Fang Wang
Abstract In this paper,a stochastic linear quadratic optimal tracking scheme is proposed for unknown linear discrete-time (DT) systems based on adaptive dynamic programming (ADP) algorithm.First,an augmented system composed of the original system and the command generator is constructed and then an augmented stochastic algebraic equation is derived based on the augmented system.Next,to obtain the optimal control strategy,the stochastic case is converted into the deterministic one by system transformation,and then an ADP algorithm is proposed with convergence analysis.For the purpose of realizing the ADP algorithm,three back propagation neural networks including model network,critic network and action network are devised to guarantee unknown system model,optimal value function and optimal control strategy,respectively.Finally,the obtained optimal control strategy is applied to the original stochastic system,and two simulations are provided to demonstrate the effectiveness of the proposed algorithm.
Keywords Stochastic system·Optimal tracking control·Adaptive dynamic programming·Neural networks
As is well known,optimal tracking control (OTC) plays a significant role in control field and develops fast in both theory [1–4] and applications [5–7].The aim of OTC is to design a controller,which enables the output to track a reference trajectory by minimizing a predefined performance index.However,traditional OTC approaches,such as feedback linearization [1] and plant inversion [2],are usually dragged in complex mathematical analysis and have trouble in controlling highly nonlinear plants.As for the linear quadratic tracking (LQT) problem,solutions can be obtained by solving an algebraic Riccati equation (ARE)for feedback term and a noncausal difference equation for feedforward term [8].Nevertheless,it is worth pointing out that the method mentioned above requirea priorisystem dynamics.Therefore,it challenges us to deal with the optimal tracking control problems with completely unknown system information.
The key point of OTC is to calculate the nonlinear Hamilton–Jacobi–Bellman (HJB) equation,which is too complex to obtain an analytical solution.Though dynamic programming (DP) works as an effective method for solving HJB,it is often computationally untenable due to“curse of dimensionality”[9].To approximate solutions of HJB equation,adaptive dynamic programming (ADP) algorithms have been extensively employed and developed.Value iteration(VI) [10] and policy iteration (PI) [11] pave the way for the achievement of ADP algorithm.To overcome the unknown system,researchers try to rebuild the model base on datadriven techniques [12].By using useful input-output data,data-driven models,such as Markov models,neural network(NN) models and others,can replace system dynamics with input-output mapping.For discrete-time (DT) systems,ADP algorithms are proposed to deal with the OTC problems relying on NN based data-driven model [3,13].As for continuous system (CT),a synchronous PI algorithm scheme is applied to tackle the OTC with unknown dynamics via rebuilding system model [14].However,model reconstruction method [3,13–15] may be subjected to modeling accuracy.To get rid of it,a simultaneous policy iteration(SPI) algorithm is proposed to deal with the optimal control problem for partially unknown nonlinear system [16],and further the author [17] extends the SPI algorithm to deal with optimal control problems for completely unknown nonlinear system based on least squares method and Mont Carlo integration technique.Also,Bahare [18] proposed a PI algorithm and a VI one to solve the LQT ARE online only depending on measured input,output,and reference trajectory data.Besides,a Q-learning algorithm is proposed to obtain the optimal control by calculating an augmented ARE,not relying on the system dynamics or the command generator dynamics [4].
Note that the aforementioned ADP based schemes provide multiform approaches for OTC problem,however,only the noise-free cases are taken into consideration.In fact,there will exist intrinsic nonlinear characteristic in LQT when original system is subjected to multiplicative noises,so that standard tools for LQT cannot be applied directly.Although traditional adaptive control methods can guarantee good tracking performance for stochastic system,the optimality aspect of the system is usually ignored [19].
As we know,the stochastic linear quadratic (SLQ) optimal control problem is complicated due to the existence of multiplicative noises,but there is an equivalent relationship between the feasibility of the SLQ optimal control problem and the solvability of the stochastic algebraic equation (SAE)[20].Moreover,with the help of linear matrix inequalities[21],semidefinite programming [22],and Lagranger multiplier theorem [23],solving the SLQ optimal control problem becomes easier.Nevertheless,the aforementioned schemes[20–23] work under the prerequisite that system dynamics is completely known.To overcome the difficulty that the model is unknown,the authors [24] proposed an ADP algorithm to solve the SLQ optimal control problem based on 3 NN models.Moreover,the authors [25] adopted a Q-learning algorithm to settle SLQ optimal control problem for modelfree DT systems.Moreover,the authors [26] investigate non-model-based ADP algorithm to address optimal control problem for CT stochastic systems influenced by multiplicative noises.
To the best of our knowledge,there seems to be a lot of ADP-based SLQ optimal control schemes while SLQ optimal tracking control has obtained seldom attention.The SLQ optimal tracking control problem were investigated in [27,28];however,only the control-dependent noise was discussed and system dynamics should be completely known in advance.When the model is unknown,there may exist huge challenges in SLQ optimal tracking problems for stochastic systems with multiplicative noises.Besides,a non-stable command generator is taken into account in this paper,which leads to traditional mean square concepts in terms ofxkin [24,25]being not suitable anymore,thus the stability of the system cannot be guaranteed.
Facing the aforementioned difficulties,we propose a model unknown SLQ optimal tracking scheme using the ADP algorithm.The main contributions can be summarized as follows:
(1) To solve the SLQ optimal tracking problem for unknown systems with multiplicative noises,an ADP algorithm is proposed in this paper,and a model-criticaction structure is introduced to obtain the optimal control strategy for stochastic systems whose dynamics are unknown.
(2) To ensure the stability of the system,mean square concepts aboutekis newly defined,and a discount factor is introduced to the cost function,then an augmented SAE are derived to obtain the optimal control based on the augmented system.
The rest of this paper is organized as follows.In Sect.2,we give the problem formulation and conversion.In Sect.3,we carry out the derivation and convergence proof of the VI ADP algorithm.In Sect.4,we make use of back propagation neural networks (BPNN) to realize the ADP algorithm.In Sect.5,two examples are given to illustrate the effectiveness of the proposed scheme.Finally,the conclusion is mentioned in Sect.6.
Consider the linear stochastic DT system described as follows:
wherexk∈?n,uk∈?mandyk∈?prefer to system state,control input and system output,respectively.Initial state of system (1) isx0,A,C∈?n×nandB,D∈?n×mare given constant matrices.One-dimensional stochastic disturbance sequenceωk(k=0,1,2,…,ω0=0) is defined on the given probability space (Ω,F,P),which is actually a measure space with total measure equals to 1,that is P(Ω)=1 .Moreover,Ω,F,P are defined as sample space,set of events and probability measure,respectively.The stochastic sequence is assumed to meet the following condition:
where Fk=σ{ωk|k=0,1,2,…} refers to theσ? algebra obtained byωk.x0is irrelevant withωk,k=0,1,2,…,and E(?) denotes the mathematical expectation.
The tracking error is described by
whererkis the reference trajectory.
Assumption 1The reference trajectory for the SLQ optimal tracking problem is generated by the command generator
A cost function is essential to measure the optimality in SLQ optimal tracking problem.Therefore,the quadratic cost function to be optimized is denoted as
whereQandRare positive semidefinite symmetric matrix and positive definite symmetric matrix,respectively.
The cost function (4) usually can be used whenFis Hurwitz.However,by adding a discount factor in (4),we can tackle the SLQ tracking control problem even for the cases that the command generator dynamicsFis not Hurwitz.Consider the discounted cost function as follows:
where the discount factor satisfies 0<γ≤1 .It is worth mentioning thatγ=1 can only be used whenFin (3) is Hurwitz.
Considering thatFis not Hurwitz in this paper,the mean square definition in terms ofxkin [24] is no longer suitable.Thus we provide some new definitions.
Definition 1ukis considered to be mean-square stabilizing ate0if there exists a linear feedback form ofuk.Hence for every initiale0,system (2) can lead to
Definition 2System (3) with a mean-square stabilizing feedback control is considered to be mean-square stabilizable.
Definition 3ukis called admissible if it satisfies the following three conditions:first,it is a Fk? adapted and measurable stochastic process;second,it is mean-square stabilizing;third,it enables the cost function to reach the minimum value.All admissible controls are gathered in a setUad.
The goal of SLQ optimal tracking control problem is to seek for an admissible control which can not only minimize the cost function (5) but also stabilize system (2) for each initial statee0,namely
To achieve the goal above,an augmented system including the system dynamics (1) and the reference trajectory dynamics (3) is constructed first as follows:
Based on (7),cost function (5) can be further denoted as
whereQ1=[G?I]TQ[G?I].
Then,the optimal tracking control with linear feedback form is given by
where the constant matrixKis regarded as mean square stabilizing control gain matrix if it satisfies Definition 1.
Therefore,the cost function (8) can be further transformed into the following equation with respect toK,namely
Thus,the goal of SLQ optimal tracking control problem (6)can be further expressed as
Definition 4The SLQ optimal tracking control problem is considered well-posed if
It is well known that there is an equivalent relationship between the feasibility of the SLQ optimal control problem and the solvability of the SAE.Next,it is shown that the SLQ optimal tracking problem is well posed with the help of augmented SAE.Therefore,we provide the following lemma first.
Lemma 1The SLQ optimal tracking control problem is called well posed if there exists admissible control uk=KXk∈Uadand the following related value function:
where the symmetric matrix P meets the following augmented SAE:
Then,the following assumptions are made to ensure the existence of admissible
Assumption 2The tracking error system (2) is mean-square stabilizable.
Assumption 3The augmented system (7) is controllable.
It is well known that ADP algorithm has obtained huge success in deterministic OTC designs [3,4,13–18],which inspires us to solve the SLQ optimal tracking problem by transforming the stochastic system into a deterministic one.
Accordingly,the cost function (10) is rewritten as a deterministic form
Remark 1The deterministic system (20) is independent on the stochastic disturbanceωkand will only be decided by initial stateZ0and control gain matrixK,which create well conditions for the apply of ADP algorithm.
In this section,we propose a value iteration ADP algorithm to obtain the optimal control for the SLQ optimal tracking problem.Thus we provide the formula of optimal control and related SAE first.
whereP?satisfies the the augmented SAE (12) andZkis the state of deterministic system (19).
An essential condition for the optimality is the first-order necessary condition.By calculating the derivatives of optimal value function (22) with regard toK,we have the following HJB equation:
From Lemma 2,the SLQ optimal tracking problem could be effectively dealt with by the solution of augmented SAE.The difficulty is that the augmented SAE is usually hard to calculate the analytical solution and it requires full knowledge of the system dynamics.Unfortunately,it becomes impossible to solving SAE when system dynamics are totally unknown.To deal with tricky SLQ optimal tracking problem with unknown system,we provide a value iteration ADP scheme as follows.
Assume that value function begins with initialV0(?)=0,then initial control gain matrixK0can be calculated by
It is worth pointing out thatiis the iteration index whilekis the time index.Next,it is important to show the convergence proof of the proposed method,which is iterating between(31) and (32).
Before proving the convergence,some lemmas are primarily provided.
Lemma 3Let the value function sequence{Vi}be denoted in(32).Suppose that K is mean-square stabilizing control gain matrix,and there exists a least upper bound which meets0 ≤Vi(Zk)≤V?(Zk)≤Ω(Zk) ,where the optimal value function V?(Zk)is shown in(22).
Considering both (40) and (41),we come to the conclusion that
Theorem 2Assume that the sequences{Ki}and{Vi}are denoted as(31)and(32),then V∞=V?and K∞=K?,where K?is mean square stabilizing.
ProofFrom the conclusion about sequence {Vi} in Lemma 3,it follows that
According to the convergence proof,we know that during the process of value iteration based on deterministic systemZk,the proposed ADP algorithm can lead toSinceK?is mean square stabilizing,for stochastic system,the tracking error between the output and the the reference signal can be ensured mean square stabilizable,that is
We have proved that the value iteration ADP method will be convergent to the optimal solution of the DT HJB equation.It is clear that the proposed method can be perfectly solved by the iteration between (31) and (32).In this section,we consider how to achieve the proposed scheme without knowing system dynamics.
To achieve this,we are going to apply three BPNNs,including a model network for unknown system dynamic,a critic network for value function approximation and an action network for control gain matrix approximation.We assume that the BPNN is made up of input layer (IL),hidden layer(HL) and output layer (OL).Besides,the amount of neurons in HL isn,and weighting matrix between IL and HL isψwhileζdenotes weighting matrix between HL and OL.The output of the BPNN is expressed as
where vec(x) means vectorization of input matrixxandρ(?)∈?nrepresents the bounded activation function,which is denoted as
To deal with the unknown system dynamic,a model network is firstly designed to identify the unknown system.Then based on the model network,the critic network and action network are employed to approximate the optimal value function and control gain matrix.The whole structure diagram is shown in Fig.1.
Fig.1 Structure diagram of the iterative ADP algorithm
For the model network,we provide initial stateZkand control gain matricesK,the output of model network is
To achieve our purpose,the weighting matrices are renewed using gradient descent method:
whereαmdescribes the learning rate andiis the iterative step in the updating process of weighting matrices.
When the training of model network succeeds,the weight matrixes will keep fixed.Next,the critic network is designed for value function approximation based on well trained model network.By providing the input stateZk,the output of critic network is
whereαc >0 means the learning rate for critic network.
The action network aims to obtain control gain matrixK,which regardsZkas input and the output is given by
whereαa >0 refers to the learning rate for action network.
The gradient descent method is a powerful way of seeking for a local minimum of a function and will finally be convergent where the gradient is zero.
In this section,two simulation examples are performed to demonstrate the effectiveness of the proposed method.
Most existing researches on tracking control for stochastic system are limited to either state or control-dependent multiplicative noises.In fact,it is more common that both of them exist in the SLQ optimal tracking problem.Next,considering the following linear DT stochastic system with both control and state dependent multiplicative noises,the one-dimensional output optimal tracking control problem is studied:
SetQ=10,R=1 andγ=0.8 for cost function (5) while initial state for augmented system (19) is chosen as
The structures of three BPNNs including model network,critic network and action network are selected as 12-8-9,9-8-1 and 9-8-3,respectively.Moreover,the initial values of weight matrices in three BPNNs are all set to be stochastic in [?1,1] .To start with,set the learning rateαm=0.05,then we train the model network for 500 iterative steps with 1000 sample data.Next,we are going to perform the ADP algorithm based on the well trained model network.The action network and the critic network are trained for 300 iterative steps with each 500 inner training iterations under the condition that learning ratesαcandαaare both selected as 0.01.
The trajectory of value function is depicted in Fig.2,which reveals the value function is a nondecreasing sequence in the iteration process.Thus the effectiveness of Lemma 4 is verified.
Fig.2 Convergence of the value function during the learning process
In addition,Fig.3 describes the curves of control gain matrix acquired by the iterative ADP algorithm,in which three components of control gain matrix finally become convergent to fixed values.Furthermore,by defining||K?K?||=norm(K?K?),we contrast theKobtained by ADP algorithm with optimal solutionK?from SAE (26).Figure 4 shows that ||K?K?|| turns into zero finally,which indicates that the ADP algorithm converges very closely to the optimal tracking controller and demonstrates the effectiveness of the ADP algorithm.
Fig.3 Curves of control gain matrix
Fig.4 Convergence of control gain matrix K to optimal K?
The obtainedKabove is then applied to the original system (58).Fig.5 displays that the mean square errorsturns into zero ultimately,which illustrates that system (2) is mean-square stabilizable andKis mean-square stabilizing.We know that mean-square stabilizing is a kind of concept in the sense of statistics,which is used to describe the stability of stochastic system.Further,we are going to describe the system output in the sense of statistics based on mathematical expectation.As shown in Fig.6,the expectation of system output E(y) can track the reference signal effectively,which further proves the effectiveness of the proposed ADP algorithm.
Fig.5 Curve of mean square errors
Fig.6 Curves of expectation of output E(y) and reference signal r
In this section,a more complex situation,in which the twodimensional output optimal tracking control problem is studied.The linear DT stochastic system with both control and state-dependent multiplicative noise is described by
Three networks containing model network,critic network and action network are all established by BPNNs with structures of 20-8-16,16-8-1 and 16-8-4,respectively.Moreover,weight matrixes in three networks are initialized to be random in [?1,1] .We firstly train the model network for 500 iterative steps with 1000 sample data usingαm=0.01 .Furthermore,for the purpose of achieving the ADP algorithm,the action network and the critic network are iterated for 200 steps with each 1000 inner training iterations usingαc=αa=0.001.
Based on the simulation results,it can be seen that the value function is monotonically nondecreasing in Fig.7,which further proves the correctness of Lemma 4.
Fig.7 Convergence of the value function during the learning process
Besides,as is shown in Fig.8,four components of the control gain matrixKcalculated by the ADP algorithm are convergent finally.ThenKand the optimalK?obtained by ADP algorithm and analytical algorithm respectively are contrasted in Fig.9,where we can see that ||K?K?|| becomes zero as time steps increase.Thus it can be concluded thatKis convergent toK?and the ADP algorithm is feasible.
Fig.8 Trajectory of control gain matrix
Fig.9 Convergence of control gain matrix to optimal K?
Then,the obtained control gain matrixKis applied to stochastic system (60).Figure 10 shows that mean square errors become zero finally,which illustrates thatKis meansquare stabilizing.Then we consider the system output in the sense of statistics based on mathematical expectation.From Fig.11 and Fig.12,it is clear that E(y1) and E(y2) can achieve effective tracking with respect to reference signalr1andr2respectively,thus the ADP algorithm is valid.
Fig.10 Curves of mean square errors
Fig.11 Curves of expectation of output 1 E(y1) and reference signal r1
Fig.12 Curves of expectation of output 2 E(y2) and reference signal r2
This paper deals with optimal tracking control problem for stochastic system with unknown model.To obtain the optimal control strategy for this problem,a value iterative ADP algorithm is proposed.We first use the BPNN to rebuild the model via data driven technique.Then based on the well trained model,the cost function and the control gain matrix are ensured to get close to the optimal values during the iterative process of the proposed method.Ultimately,two simulation examples are implemented to verify the effectiveness of the proposed algorithm.
AcknowledgementsThis work was supported by the National Natural Science Foundation of China (No.61873248),the Hubei Provincial Natural Science Foundation of China (Nos.2017CFA030,2015CFA010),and the 111 project (No.B17040).
Control Theory and Technology2021年3期