亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

        ?

        Policy Iteration for Optimal Control of Discrete-Time Time-Varying Nonlinear Systems

        2023-03-27 02:38:18GuangyuZhuXiaoluLiRanranSunYiyuanYangandPengZhang
        IEEE/CAA Journal of Automatica Sinica 2023年3期

        Guangyu Zhu, Xiaolu Li, Ranran Sun, Yiyuan Yang, and Peng Zhang

        Abstract—Aimed at infinite horizon optimal control problems of discrete time-varying nonlinear systems, in this paper, a new iterative adaptive dynamic programming algorithm, which is the discrete-time time-varying policy iteration (DTTV) algorithm, is developed.The iterative control law is designed to update the iterative value function which approximates the index function of optimal performance.The admissibility of the iterative control law is analyzed.The results show that the iterative value function is non-increasingly convergent to the Bellman-equation optimal solution.To implement the algorithm, neural networks are employed and a new implementation structure is established,which avoids solving the generalized Bellman equation in each iteration.Finally, the optimal control laws for torsional pendulum and inverted pendulum systems are obtained by using the DTTV policy iteration algorithm, where the mass and pendulum bar length are permitted to be time-varying parameters.The effectiveness of the developed method is illustrated by numerical results and comparisons.

        I.INTRODUCTION

        OPTIMAL control has been well developed for control of dynamical systems with a time-invariant model in control system society [1]–[13].As is known to all, the optimal solution for a control system can be obtained by solving the Bellman equation.Generally speaking, it is very difficult to obtain the analytical solution of the Bellman equation, especially for nonlinear systems.In order to obtain the approximate control solutions, several methods have been proposed,where adaptive dynamic programming (ADP), proposed by Werbos [14], [15], has gained much attention from researchers, as this method permits solving optimal control problems forward-in-time.To solve the Bellman equation, iterative method is the main tool in ADP and has been increasingly emphasized [16]–[26].

        Policy iteration is one of the main iteration algorithms in ADP [27]–[29].In 1996, Bertsekas and Tsitsiklis [30] proposed a policy iteration algorithm for nolinear systems.In 2002, for continuous-time affine systems with a quadratic utility function, Murrayet al.[31] first put forward the convergence of the policy iteration.In 2005, for continuous-time nonlinear systems with constraints of control, a bound-constraint policy iteration algorithm was proposed [32].In [33],for polynomial nonlinear systems, the policy iteration ofH∞optimal control was investigated via sum of squares programming.In [34], in order to obtain optimal output tracking control, the policy iteration by a multi-step policy evaluation was proposed.To obtain the optimal control law according to system data for an unknown system model, Yanet al.[35] proposed aQ-learning algorithm based on policy iteration, where the error of theQfunction was considered.For a class of discrete-time system with multiple delays, a data-based ADP algorithm was proposed in [36].For affine nonlinear systems,off-policy iteration-based ADP algorithms were widely used for data-based optimal control [37]–[39].In [40]–[42], the optimal cooperative control for multi-agent systems was obtained by using policy iteration.From the previous literature, most policy iteration ADP algorithms were focused on the optimal control of time-invariant nonlinear systems[31]–[46].However, it should be noted that the existing policy iteration methods are aimed at optimal control problems for time-invariant systems, where both of the performance index functions and optimal control laws are functions of the system states.As far as authors know, there are no policy iteration methods for discrete-time time-varying (DTTV) systems in existing research results, which is the starting point of this paper.

        In numerous industrial applications, however, the systems’dynamics may change over time.In the case of time-varying systems, optimal performance index functions and optimal control law are related to not only the state but also the time index, which makes traditional time-invariant policy iteration algorithms unavailable for systems with time-varying dynamics.Establishing a new iteration method to cope with the timevarying properties of time-varying systems is the main theoretical difficulty and challenge of this paper.Aimed at discrete-time infinite horizon optimal control problems of timevarying nonlinear systems, a novel iteration algorithm for DTTV policy iteration algorithm is developed in this paper.The main contributions can be given as follows:

        1) A novel DTTV policy iteration algorithm is presented in this paper.It is the first time that an ADP method is developed for DTTV systems.

        2) The properties of the presented algorithm, including monotonicity, convergence and optimality, are analyzed in detail.

        3) Simulation results show that the presented method can obtain the optimal control law and optimal performance index function, which verifies the correctness of the DTTV policy iteration method.

        First, for the DTTV policy iteration algorithm, the iteration rules for the iterative value function and iterative control law are designed.Next, it will be shown that any of the iterative time-varying control laws is an admissible control law.Then,the fact that the value functions are monotonically nonincreasing and converge to the optimal performance index function will be proven.We establish new neural network structures to approximate the time-varying iterative control laws and value functions, and the developed DTTV algorithm is implemented, where the difficulty in solving the generalized time-varying Bellman equation can be avoided.Finally,in order to solve the optimal control for the torsional pendulum and inverted pendulum systems, the present DTTV algorithm are implemented, where the mass and length of the systems are permitted to be time-varying parameters.In this situation, the effectiveness of the developed algorithm is verified by the numerical results.

        II.PROBLEM FORMULATIONS

        Therefore, we express the Bellman equation (4) as

        III.DISCRETE-TIME TIME-VARYING POLICY ITERATION

        For the DTTV policy iteration algorithm, the derivations will be discussed in this section.We will study the admissibility of the iterative control law, and then analyse the convergence of the iterative value functions to prove that the functions will converge to the optimum.

        A.Derivations of the DTTV Policy Iteration Algorithm

        Algorithm 1 DTTV Policy Iteration Algorithm Require:Choose an admissible control law randomly.Choose a computation precision σ.Ensure:i=0 V0(xk,k)v0(xk,k)1: Let the iteration index.Construct which satisfies the time-varying Bellman equation (7).i=i+1 2: Let.Perform policy improvementvi(xk,k)= argminuk{U(xk,uk,k)+Vi?1(xk+1,k+1)}.(10)3: Perform policy evaluationVi(xk,k)=U(xk,vi(xk,k),k)+Vi(xk+1,k+1).(11)4: If , goto Step 2.vi(xk,k)Vi(xk,k)Vi?1(xk,k)?Vi(xk,k)>σ 5: return ,.

        Remark 1:It is worth pointing out the difference between the traditional time-invariant policy iteration algorithms in[46] with the developed algorithm (7)–(9).First, for traditional time-invariant policy iteration algorithm in [46], the iterative control law and iterative value function are both time-invariant functions, which only depended on the statexk.Meanwhile, the iterative control law and iterative value function are both time-varying functions for the time-varying policy iteration algorithm, which means the iterative control lawvi(xk,k) and iterative value functionVi(xk,k) are different for differentk.Second, for the algorithm in [46], obtaining the iterative control lawvi(xk), the iterative value function can be derived by the following equation:

        In the time-varying policy iteration algorithm, for anyi=0,1,..., in (9), the iterative value functionsVi(xk+1,k+1)andVi(xk,k) are different functions.It implies that we cannot obtain the iterative value functionVi(xk,k) by directly solving(7) and (9).Furthermore, for the time-varying policy iteration algorithm, the analysis method for the time-invariant algorithm in [46] cannot be directly used.Hence, new analysis and implementation methods for the time-varying policy iteration algorithm is developed in this paper.On the other hand, if the system and the utility function are both time-invariant, then the DTTV policy iteration algorithm is reduced to the traditional policy iteration algorithm in [46].Hence, we say that the traditional policy iteration algorithm is a special case for the developed DTTV policy iteration algorithm and the DTTV policy iteration algorithm possesses more application potential.

        Remark 2:With the development of computational capacity,there is a growing number of needs for the control of complex systems.For complex systems, the system models are always nonlinear and time-varying, which bring great challenges for control and optimization.Furthermore, traditional policy iteration methods can only tackle optimization problems for systems with time-invariant iterative control law and the iterative value function.Establishing a new policy iteration method to obtain the optimal iterative control law and optimal iterative value function of DTTV systems is also a challenge.

        B.Property Analysis

        For time-invariant nonlinear systems, although the policy iteration algorithm properties were analyzed in [46], it cannot directly be applied to time-varying nonlinear systems.For the DTTV policy iteration algorithm, the properties will be analyzed in this subsection.The optimality and convergence of the iterative value function will be developed, while the admissibility of the iterative control law is discussed.First,some lemmas are necessary for discussions.

        Lemma 1:Letvi(xk,k) andVi(xk,k) be obtained by (7)–(9)fori=0,1,...Ifvi(xk,k),i=0,1,..., is an admissible control law for system (1), there exists an iterative value functionVi(xk,k)that satisfies (7) and (9).

        Proof:For admissible control lawvi(xk,k) , wheni=0,1,...,we get

        Fig.1.The structure diagram of the algorithm.

        After combining (33) and (38), the conclusion (28) can be drawn.■

        IV.IMPLEMENTATION OF THE DTTV POLICY ITERATION

        A.The Critic Network

        The iterative value functionVi(xk,k),i=0,1,..., is approximate to by using critic network.It should be noted that, for the approximation of time-varying policy iteration algorithm, the neural network approximation for the time-invariant policy iteration algorithm [46] is invalid.First, the iterative value function in [46] is a time-invariant function which is only a function ofxk, i.e.,Vi(xk), for the traditional time-invariant policy iteration algorithm.The iterative value function for the DTTV algorithm is the function for not onlyxkbut alsok, i.e.,Vi(xk,k).For the time-varying policy iteration algorithm, the neural network approximation in the traditional time-invariant policy iteration algorithm is unsuitable to be employed.Second, for time-invariant policy iteration algorithms of systems, the iterative value functionVi(xk) can be obtained by solving (12) which is generally easy to solve, in the case of an iterative control lawvi(xk) being given.However,Vi(xk,k)cannot be obtained by directly solving (7) and (9) with the time-varying policy iteration algorithm, in the case of an iterative control lawvi(xk,k) being given.Thus, in this subsection,the neural network approximation must be improved.

        Fori=0,1,..., the iterative control lawvi(xk,k) is proven to be admissible, and the iterative value functionVi(xk,k) converges to zero ask→∞.Thus, for anyi=0,1,..., the iterative value functionVi(xk,k) for 0 ≤k<∞ can be approximated by the functionVi(xk,k) for 0 ≤k≤Tf, whereTfis a large positive integer and for allk>Tf, we regardVi(xk,k)=0.Let M be a positive integer.Letkj,j=0,1,...,M, be time instants, which satisfy 0 ≤kj≤Tf.

        B.The Action Network

        V.NUMERICAL EXAMPLES

        In this section, the effectiveness of the developed algorithm is illustrated by two pendulum systems, which are the torsional pendulum and inverted pendulum systems, respectively.

        Example 1:First, for a torsional pendulum system [47] with modifications, where the mass of the pendulum is time-varying, the performance of the developed algorithm is considered.Give the pendulum dynamics as follows:algorithm, neural networks are used.Three-layer BP neural networks are adopted for the action network and the critic network, with the structures of 3–8–1 and 3–8–1, respectively.The neural network training errors are less than 1 0?4by training the action network and the critic network for 2000 steps step under the learning rate 0.01, in each iteration.LetTf=10 s andN=200.Implementing the algorithm for 20 iterations, the convergence ofVi(xk,k) atx0is shown in Fig.2.In Fig.3, the control and states trajectories are shown.It indicates that the iterative value function is convergent to the optimum and non-increasing.The iterative control law is admissible for each iteration.

        Fig.2.Convergence plots of the iterative value functions.

        In order to show the effectiveness of the developed method,the numerical optimal open-loop solution of the continuoustime time-varying optimal control problem is obtained by direct collocation methods [48].In detail, the continuous-time time-varying optimal control problem is first discretized and formulated as a nonlinear programming using the publicly available Open Optimal Control Library [49].Then the resulting nonlinear programming problem is solved by the interior point optimizer [50].In Fig.4, the optimal open-loop control and states trajectories are shown.From Figs.3 and 4, it is shown that the trajectories obtained by direct collocation methods are almost the same as the ones by the developed method.Thus, the optimality of the developed algorithm is verified.However, the optimal control obtained by direct collocation methods is an open-loop control law and is applicable to a single initial state, while the optimal control obtain by the developed method is a closed-loop optimal control law and is applicable to all initial states.

        Example 2:The discretized inverted pendulum system [51]with modifications is considered in the second example.The dynamic of the time-varying pendulum system is expressed as

        Fig.3.Trajectories of states and controls.(a) x1; (b) x2; (c) u.

        Fig.4.Trajectories of optimal states and controls.(a) x1; (b) x2; (c) u.

        With the sampling interval ?t=0.1 s, the performance index function and the system function are discretized, that leads to

        wheremkand ?kdenote the time-varying mass and length of the pendulum bar, respectively.Letg=9.8m/s2be the gravitational acceleration, and let κ=0.2 be the frictional factor.The same utility function as the one in Example 1 is chosen.To implement the algorithm, neural networks are used.Threelayer BP neural networks are adopted for the action network and the critic network with the structures of 3–8–1 and 3–8–1,respectively.

        Fig.5.Numerical results for Case 1.(a) Iterative value function; (b) Statex1; (c) State x2; (d) Control.

        Fig.6.Numerical results for Case 2.(a) Iterative value function; (b) Statex1; (c) State x2; (d) Control.

        Fig.7.Numerical results for Case 3.(a) Iterative value function; (b) Statex1; (c) State x2; (d) Control.

        Fig.8.Numerical results for Case 4.(a) Iterative value function; (b) Statex1; (c) State x2; (d) Control.

        Compared with the time-invariant system in Case 1, it is shown that the convergence processes in Cases 2–4 for the time-varying systems are obviously different.The optimal performance index functions and optimal control laws for time-varying systems are also different from the ones for the time-invariant system.On the other hand, it is shown that for different time-varying systems in Cases 2–4, the convergence processes are also different.However, it is declared that using the DTTV policy iteration algorithm, the iterative value function is non-increasing and converges to the optimum.Furthermore, it is shown that the iterative control law is admissible.Hence, for the developed DTTV policy iteration algorithm,the convergence, optimality, and admissibility properties can be verified.

        VI.CONCLUSIONS

        A novel iterative ADP algorithm, called the“discrete-time time-varying (DTTV) policy iteration”algorithm, is presented in this paper.For discrete-time time-varying nonlinear systems, the developed algorithm is used to solve optimal control problems with an infinite horizon.An admissible control law is used in the initialization of the DTTV policy iteration algorithm.Two iteration procedures are introduced for the algorithm.According to the iterative value function, the iterative control law is updated in the policy update step.Next, the iterative value function is improved via the updated iterative control law in the policy improvement step.It is shown that the iterative value function converges to the optimum and is non-increasing.The DTTV policy iteration algorithm is implemented by neural networks, where detailed training rules of the networks are presented.Then, the optimal control laws for torsional pendulum and inverted pendulum systems are obtained by using the DTTV policy iteration ADP algorithm,where the mass and pendulum bar length are permitted to be time-varying parameters.Finally, the performance of the presented method is illustrated by the numerical results and comparisons.Some interesting future work are to extented the time-varying policy iteration method to optimal control problems with finite-horizon performance index functions and optimal control problems of periodic systems.

        国产白浆流出一区二区| 女同性黄网aaaaa片| 久久精品免费观看国产| 中文字幕精品一二三四五六七八| 欧美日韩国产专区| AⅤ无码精品视频| 久久熟女乱一区二区三区四区| 三级日本午夜在线观看| 亚洲中文字幕一区精品| 加勒比一本heyzo高清视频| 国产内射999视频一区| 亚洲自拍另类制服在线| 久草视频华人在线观看| 久久久精品人妻一区二区三区游戏| 精品人妻伦一二三区久久| 永久免费av无码入口国语片| jjzz日本护士| 中文字幕亚洲精品人妻| 亚洲精品岛国av一区二区| 精品一区二区三区在线视频| 午夜理论片yy6080私人影院| 日日摸夜夜添狠狠添欧美| 国产精品美女久久久浪潮av| 黄色影院不卡一区二区| 欧美亚洲国产一区二区三区| 性生交大片免费看淑女出招| 色www亚洲| av资源在线免费观看| 中文字幕女同系列在线看一| 最近在线更新8中文字幕免费 | 成人毛片av免费| 国产91精选在线观看麻豆| 亚洲国产高清美女在线观看 | 亚洲国产日韩欧美一区二区三区| 最近高清中文在线字幕观看| 激情文学人妻中文字幕| 偷偷夜夜精品一区二区三区蜜桃| 日韩人妖视频一区二区| 亚洲av成人网| 亚洲网站地址一地址二| 国产精品一久久香蕉国产线看观看|