Minimax Q-learning design for H∞control of linear discrete-time systems*

2022-03-25 10:50:20XinxingLILeleXIWenzhongZHAZhihongPENG

Frontiers of Information Technology & Electronic Engineering 2022年3期

Xinxing LI, Lele XI, Wenzhong ZHA, Zhihong PENG

1Information Science Academy, China Electronics Technology Group Corporation, Beijing 100086, China

2School of Automation, Beijing Institute of Technology, Beijing 100081, China

3Peng Cheng Laboratory, Shenzhen 518052, China

Abstract: The H∞control method is an effective approach for attenuating the effect of disturbances on practical systems, but it is difficult to obtain the H∞controller due to the nonlinear Hamilton—Jacobi—Isaacs equation, even for linear systems. This study deals with the design of an H∞controller for linear discrete-time systems. To solve the related game algebraic Riccati equation (GARE), a novel model-free minimax Q-learning method is developed, on the basis of an offline policy iteration algorithm, which is shown to be Newton’s method for solving the GARE. The proposed minimax Q-learning method, which employs off-policy reinforcement learning, learns the optimal control policies for the controller and the disturbance online, using only the state samples generated by the implemented behavior policies. Different from existing Q-learning methods, a novel gradient-based policy improvement scheme is proposed. We prove that the minimax Q-learning method converges to the saddle solution under initially admissible control policies and an appropriate positive learning rate, provided that certain persistence of excitation (PE)conditions are satisfied. In addition, the PE conditions can be easily met by choosing appropriate behavior policies containing certain excitation noises, without causing any excitation noise bias. In the simulation study, we apply the proposed minimax Q-learning method to design an H∞load-frequency controller for an electrical power system generator that suffers from load disturbance,and the simulation results indicate that the obtained H∞load-frequency controller has good disturbance rejection performance.

Key words: H∞control; Zero-sum dynamic game; Reinforcement learning; Adaptive dynamic programming;Minimax Q-learning; Policy iteration

1 Introduction

Reinforcement learning (RL) is an efficient machine learning technique for dealing with sequential decision-making problems when an agent interacts with an external environment, such as Markov decision processes (Sutton and Barto, 1998). The core mechanism of RL is that an agent unceasingly modifies its action, based on the observed stimuli or reward received from the environment, via trial-and-error. Compared with the traditional dynamic programming (DP) technique for handling sequential decision-making problems, RL runs forward in time (i.e., online) and overcomes the curseof-dimensionality problem, and can find the optimal policy even in a dynamic environment, e.g.,dynamic games. It has been shown that RL combines the advantages of optimal and adaptive control(Kiumarsi et al., 2018), which makes it a promising technique for solving optimal control problems and dynamic games. In the control field, RL is also referred to as adaptive dynamic programming(ADP).ADP approaches can be classified into several main schemes: heuristic dynamic programming (HDP),action-dependent HDP (ADHDP), dual heuristic dynamic programming (DHP), ADDHP, globalized DHP(GDHP),and ADGDHP(Prokhorov and Wunsch, 1997). During the last few years, many elegant ADP approaches have been proposed to solve optimal control problems(He and Zhong,2018;Li HR et al., 2020) and dynamic games (Vamvoudakis et al.,2017;Zhu et al.,2017;Li XX et al.,2019;Valadbeigi et al., 2020).

Due to the uncertainty caused by the environment, most practical systems always suffer from external disturbances. To attenuate the effect of disturbances on the system performance, controllers that can offer robust performance and guarantee stabilization are needed. One of the most effective approaches is theH∞control theory, which concentrates on designing controllers to achieve disturbance attenuation in theL2-gain setting(Doyle et al.,1989;Ba?ar and Bernhard,1995). It is well known that obtaining anH∞controller requires solving the nonlinear Hamilton—Jacobi—Isaacs (HJI) equation. However,obtaining the analytic solution of the HJI equation is impossible. Thus, an approximate solution is always obtained instead (Sakamoto and van der Schaft, 2008). Over the past few years, many ADP methods have been developed to solve continuoustime HJI equations. Luo et al. (2015) proposed a model-free policy iteration (PI) algorithm with one iteration loop for designing anH∞controller for nonlinear continuous-time systems, by employing off-policy RL.Modares et al.(2015)developed an online off-policy ADP algorithm for theH∞tracking control of continuous-time systems,to name a few.

Compared withH∞control of continuous-time systems,H∞control of discrete-time systems is more challenging,because the discrete-time HJI equations do not have a closed-loop form (Ba?ar and Bernhard, 1995). To solve discrete-time HJI equations,Mehraeen et al. (2013) proposed an offline PI algorithm with two iteration loops by using Taylor series. To obviate the need for knowledge of the system,Zhang et al.(2014)proposed an online PI algorithm by introducing a neural network (NN) identification scheme. Further, a completely model-free GDHP approach was presented without the need for the NN identifier (Zhong et al., 2018). By employing off-policy RL, Kiumarsi et al. (2017) proposed a model-free ADP method that can learn theH∞controller for linear discrete-time systems online.Q-learning serves as another powerful tool for handling discrete-timeH∞control problems. The firstQ-learning method with guaranteed convergence was proposed by Watkins and Dayan (1992)to solve Markov decision processes by employing the temporal-difference (TD) learning technique. Then,minimaxQ-learning and Nash-Qlearning were developed for zero-sum and nonzero-sum stochastic games with finite state and action spaces, respectively (Littman, 2001). Over the past few years,many efficientQ-learning approaches have been developed for optimal control(Wei QL et al.,2017;Luo et al., 2018; Wei YF et al., 2019; Yan et al., 2019)andH∞control (Al-Tamimi et al., 2007; Rizvi and Lin, 2018; Valadbeigi et al., 2020). In Al-Tamimi et al. (2007), a value-iteration-basedQ-learning algorithm with convergence guarantee was presented to solve the discrete-time zero-sum game problem,but this algorithm suffers from the excitation noise bias problem, because the injected probing noises in the policy evaluation step will cause excitation noise bias (Kiumarsi et al., 2017). On the basis of this state feedbackQ-learning, output feedbackQ-learning methods that overcome the excitation noise bias problem have been proposed (Rizvi and Lin,2018;Valadbeigi et al.,2020). Generally speaking,most of the existingQ-learning methods forH∞control of linear discrete-time systems are based on value iteration. Meanwhile, theoretical foundations for policy-iteration-basedQ-learning are relatively lacking in the literature. Although the convergence analyses of minimaxQ-learning and policy iteration for stochastic games were given in Littman (2001)and Hansen et al. (2003), respectively, these results do not hold forH∞control of discrete-time systems with continuous state and action spaces.

Inspired by off-policy RL and adaptive control,we develop a novel policy-iteration-based minimaxQ-learning method forH∞control of linear discretetime systems, with guaranteed convergence. The main contributions of this study are summarized as follows:

1.The proposed policy-iteration-based minimaxQ-learning method, which employs an off-policy RL technique,learns theH∞controller online using only the state samples generated by the behavior polices,without querying the system model or causing any excitation noise bias.

2. Different from existingQ-learning methods(Al-Tamimi et al.,2007;Rizvi and Lin,2018;Valadbeigi et al.,2020;Luo et al.,2021),we develop a novel policy improvement scheme by borrowing the idea of a stochastic gradient algorithm. The newly improved control policies can be obtained via online learning without the need for calculating the inverse of the value matrix after performing policy evaluation.Moreover, this policy improvement scheme applies toH∞control of nonlinear discrete-time systems.

3. Unlike TD-based minimaxQ-learning for stochastic games (Littman, 2001), our minimaxQlearning method is based on policy iteration. In addition,we give a rigorous convergence analysis of offline policy iteration forH∞control of linear discretetime systems by proving its equivalence to Newton’s method for solving the game algebraic Riccati equation (GARE), and on this basis, we prove that the proposed policy-iteration-based minimaxQ-learning method converges to the exact saddle solution under an appropriate learning rate and certain persistence of excitation(PE) conditions.

Notations: Rndenotes then-dimensional Euclidean space. Rn×mis the set of realn×mmatrices.?stands for the Kronecker product. vec(·)is the vectorization operator that stacks each column of a matrix into a one-column vector. For vectorx∈Rn, the Kronecker product quadratic polynomial basis vector ofxis defined asσ(x) =[x21,...,x1xn,x22,...,x2xn,...,xn-1xn,x2n]T. The Frobenius norm for matrixA∈Rn×mis defined as‖A‖= (tr(ATA))1/2, where tr(·) represents the trace of a matrix. For a real symmetric matrixE∈Rn×n,λmin(E),λmax(E), andρ(E) denote the minimum eigenvalue, maximum eigenvalue, and spectral radius ofE, respectively.

2 Problem statement

In this section, we give the formulation of the worst-case controller design problem,that is,theH∞optimal control of linear discrete-time systems. Consider the following linear discrete-time system with two types of inputs:

wherexk∈Rnis the system state vector,uk∈Rm1is the control input vector,andwk∈Rm2is an external disturbance input vector belonging to the squaresummable spaceL2(0,∞), i.e.,(thus,wkhas finite energy). The plant matrixA∈Rn×nand the input matricesB∈Rn×m1andD∈Rn×m2are assumed to be unknown.

The aim ofH∞control is to find the optimal control policyu*such that system (1) is asymptotically stable withwk= 0 and the following disturbance attenuation condition

is satisfied, whereSandRare user-defined positive definite matrices,andγ ＞0 is a prescribed constant disturbance attenuation level. To make sure that the problem is solvable,we make controllability and observability assumptions on (A,B) andrespectively.

According to Ba?ar and Bernhard (1995),theH∞optimal control problem can be equivalently translated into a two-player zero-sum linear quadratic dynamic game,i.e.,the following minimax optimization problem:

subject to

is satisfied for arbitrary admissible control policiesuandw. From inequality (4), we know that no player will deviate from (u*,w*), because a unilateral change of strategy will cause a loss of revenue for both players.

According to Bellman’s principle of optimality,the feedback saddle solution (u*,w*) should satisfy the following Bellman optimality equation:

wherexk+1=Axk+Bu*k+Dw*k. From Ba?ar and Bernhard (1995), we can represent the value functionV *(xk) as a quadratic form of the state,i.e.,V *(xk) =xTk P xk, wherePis the positive semi-definite value matrix. SubstitutingV *(xk) =xTk P xkinto Eq. (5) gives the feedback gains corresponding to the saddle solution:

hence,u*k=-K*1xkandw*k=-K*2xk. Substitutingu*kandw*kinto Eq. (5) then yields the compact form of the GARE

To guarantee a unique feedback saddle solution,the following inequalities

should be satisfied(Ba?ar and Bernhard,1995). Furthermore,the disturbance attenuation levelγshould be selected such thatγ ≥γ* ＞0 is satisfied, whereγ* ＞0 is the infimum ofγ.

From Eqs. (6)—(8), we know that obtainingK*1andK*2requires solving the GARE, which is a nonlinear matrix equation. Moreover, Eqs. (6)—(8) are dependent on full knowledge ofA,B,andD,which are assumed to be unknown in this study. In the following sections, we will develop a minimaxQlearning algorithm to learn onlineK*1andK*2without querying the system modelsA,B,andD.

3 Offline policy iteration for zero-sum linear quadratic dynamic games

Before deriving the online minimaxQ-learning algorithm,we first introduce the model-based offline PI algorithm deduced from Algorithm 1 in Kiumarsi et al.(2017). The offline PI algorithm lays the foundation for the following minimaxQ-learning algorithm. The offline PI algorithm,employing a successive approximation technique, indirectly solves the nonlinear GARE (8) by constructing a sequence of linear matrix equations. The detailed algorithm is given in Algorithm 1.

Algorithm 1 Model-based offline policy iteration algorithm 1: Start with a set of initially stabilizing feedback gains(K11,K12) // Initialization 2: For the given stabilizing feedback gains (Kl1,Kl2),solve for the corresponding value matrix Pl+1 via the following matrix equation: // Policy evaluation Pl+1 =S+(Kl1)TRKl1-γ2(Kl2)TKl2+(A-BKl1-DKl2)TPl+1(A-BKl1-DKl2)(11)3: Update the control policy and disturbance policy using the following equation: //Policy improvementimages/BZ_95_1409_2159_1439_2196.pngKl+1images/BZ_95_1571_2159_1601_2196.pngimages/BZ_95_1793_2159_1815_2196.pngimages/BZ_95_2035_2159_2057_2196.png1 Kl+1=ζ(Pl+1), (12)2 BTPl+1A DTPl+1A where ζ(Pl+1) is defined as follows:images/BZ_95_1513_2398_1535_2435.pngimages/BZ_95_2142_2398_2164_2435.png-1 ζ(Pl+1)=R+BTPl+1B BTPl+1D DTPl+1B DTPl+1D-γ2I 4: Stop if■■Kl+1 i -Kli■■≤ε (i = 1,2), where ε is a threshold; otherwise, set l =l+1 and go to step 2

The policy evaluation step(Eq. (11))is used to evaluate the performance of the given control policyul=-Kl1xand the disturbance policywl=-Kl2x.After policy evaluation, a new control policy and the disturbance policy are obtained via the certainty equivalence principle;that is,the obtained value matrixPl+1is regarded as the optimal value matrixP.In the following theorem(Theorem 1),we will prove that (Kl1,Kl2) will converge to (K*1,K*2) under any initially stabilizing feedback gains(K11,K12).

Before giving the convergence proof, we define two useful mappings. Consider the space Rn×ncomposed of alln×ndimensional real matrices. We can easily verify that Rn×nforms a Banach space under the Frobenius norm. We now define a mappingF: Rn×n →Rn×n:

whereΘ(Pl)=[ATPlB ATPlD] and

From the definition ofF, we know thatPis the zero-point of mappingF. Now we define a new mapping based onF. The new mappingT: Rn×n →Rn×nis given as follows:

whereF′Plis the Fréchet derivative ofFtaken with respect toPl. Clearly, Eq. (14) is exactly Newton’s method for obtaining the zero-point ofF,or equivalently the fixed-point of Eq.(8). Directly calculating the Fréchet derivative is always impossible, so we calculate the Gateaux derivative instead.

Definition 1(Gateaux derivative) Let: U(V)∈X→Y be a mapping from Banach space X to Banach space Y, where U(V) is a neighborhood ofV. The mappingis Gateaux differentiable atVif and only if there exists a bounded linear operatorG: X→Y such that(V+sW)-Ξ(V)=sG(W)+o(s),s →0 for allWwith‖W‖= 1 and all real numberssin some neighborhood of zero,where lims→0o(s)/s=0.The linear operatorGis called the Gateaux derivative ofΞatV; thus,Gis calculated as

Note that the Fréchet derivative atVequals the Gateaux derivativeG,if the Gateaux derivativeGexists in some neighborhood ofVandGis continuous atV. Now we turn to calculating the Fréchet derivative ofFatPlaccording to the following lemma:

Lemma 1LetFbe a mapping defined in Eq.(13).

Then the Fréchet derivative ofFatPlis given by

whereΘ(M)=[ATMTB ATMTD] and

ProofFirst,we calculate the Gateaux derivativeGatPl. Note that(I+X)-1=I-X+X2-X3+···holds for anyX∈Rn×nifρ(X)＜1 is satisfied.Selectssuch thats ＜ρ-1(ζ(Pl)?(M)) is met. We then obtain

whereΔ1(s) is the higher-order term ons; in other words, lims→0Δ1(s)/s= 0n×n. Combining Eqs.(13)and(17)we know that the Gateaux derivativeGequals the right-hand side of Eq. (16)according to the definition of the Gateaux derivative (15).Clearly,Gis continuous with respect toM, becauseGis a linear function ofM,and the system matrices are constant. Therefore,the Fréchet derivative ofFatPlequals the Gateaux derivativeG. The proof is completed.

Employing the result from Lemma 1, we can now prove that Algorithm 1 is equivalent to Newton’s method for calculating the zero-point of Eq. (8) in the Banach space Rn×n.

Theorem 1LetTbe the mapping defined in Eq.(14). Then iteration between Eqs. (11)and(12)is equivalent to the following Newton method:

ProofTo prove Eq. (18), we just need to prove the equivalent form which is given byFrom Eqs. (14) and (17), we obtain

with?(Pl)defined as follows:

Substituting Eq. (12)into Eq. (11)gives

whereΘ(Pl+1) = [ATPl+1B ATPl+1D]. Combining Eqs. (19)and(20)then results in

The proof is completed.

According to Theorem 1, we come to a conclusion thatPland (Kl1,Kl2) will converge toPand (K*1,K*2), respectively,as the iteration numberltends to infinity.

Though Algorithm 1 provides a feasible scheme for solving zero-sum linear quadratic dynamic games by operating on the reduced-order linear matrix equations, Eqs. (11) and (12) still depend on the system model,which makes Algorithm 1 sensitive to the drift in system dynamics and the inaccuracy in system modeling.

4 Online minimax Q-learning method based on off-policy reinforcement learning

To develop an intelligent algorithm that can learn the saddle solution online without querying the information of the system model,in this section,we establish an online minimaxQ-learning method by borrowing an idea from off-policy RL and adaptive control. We construct the minimaxQ-learning method on the basis of Algorithm 1.

4.1 Derivation of the online minimax Qlearning algorithm

Letul=-Kl1xandwl=-Kl2xbe the given admissible policies at thelthiteration in Algorithm 1.

We define the followingQ-function corresponding toulandwl:

whereuandware the behavior polices adopted at timek. Thus, the state at timek+1 is determined byxk+1=Axk+Buk+Dwk. From timek+1 on, one follows the target policesulandwl. According to the definition of theQ-function,we know thatQl+1(xk,uk,wk)contains two types of policies,namely,the behavior policiesuandwapplied to system (1) and the target policesulandwlwhich are expected to converge to the saddle solution. In particular,Ql+1(xk+1,ulk+1,wlk+1) =xTk+1Pl+1xk+1;therefore,Eq. (22) can be rewritten as the following Bellman equation:

We can now use Eq. (23) to calculateQl+1instead of solving Eq. (11) directly forPl+1; clearly,Eq. (23) requires no information forA,B, andD.From Eq.(22),we can representQl+1as a quadratic form of the state and inputs, or equivalently,Ql+1can be expressed asQl+1(xk,uk,wk) =WTc,l+1σk,whereσkis the Kronecker product quadratic polynomial basis vector corresponding to[xTk,uTk,wTk]T,i.e.,σk=σ([xTk,uTk,wTk]T). Hence, Eq. (23)can be rewritten as follows:

whereσk+1,lis the Kronecker product quadratic polynomial basis vector of （that is,Next, we aim to obtain the true weightWc,l+1online in real time, using only the data samples generated by the behavior polices.This is essentially a prediction problem in RL and can be solved by TD learning techniques; from the perspective of adaptive control,it becomes an online parameter identification problem. Let ?Wc,l+1(i)be the estimate ofWc,l+1at timek, withi ≤k.ReplacingWc,l+1with ?Wc,l+1(i) in Eq. (24) gives the estimation error:

Recursive least squares(RLS)(Ioannou and Fidan,2006)can now be used to estimateWc,l+1online in real time:

wherelis a positive integer. The PE condition requires that the system state be persistently exciting for a long enough period of time.

To meet the PE condition above, we can inject some exploration noise into the behavior policesuandw. Note that the injected exploration noise will not cause any excitation noise bias, although the excitation noise bias problem cannot be eliminated in on-policy methods (Kiumarsi et al., 2017). For the sake of explanation, we should confirm the fact that theQ-functionQl+1is essentially a mapping from the state-input space to R;thus,we can use the behavior policies and the state samples generated by them to identifyQl+1at each policy evaluation step.The exploration noises can be selected as harmonic signals containing sufficient frequencies or random noises. Because there exist no systematic methods for choosing exploration noises,one can choose them by trial-and-error.

After obtaining theQ-functionQl+1, we carry out the policy improvement step by solving the following minimax optimization problem:

where. DenoteandObviously,Eq. (27)can be reformulated as

whereΦa(k) = (1+xTk xk)2is the normalized term,andβis a small positive learning rate. During the learning process, the state samples are generated by behavior policiesu′andw′(that is,xk+1=Axk+Bu′k+Dw′k), and the tuning indexjis also increased with the time indexk. From Eqs. (29a)and (29b), we know that the controller and the disturbance perform gradient descent and gradient ascent, respectively. In the following theorem, we will prove thatconverges toexponentially, ifis persistently exciting andβis small enough.

We now give the complete online minimaxQlearning algorithm.

Compared with Algorithm 1, Algorithm 2 is completely model-free,and thus robust to the drift in system dynamics and the inaccuracy in system modeling. In addition,both policy evaluation and policy improvement are carried out in an online adaptive way by using the state samples generated by the behavior polices. Note that this novel online policy improvement scheme provides a potential choice for other problems,e.g., optimal control and a nonzerosum game of discrete-time systems.

Algorithm 2 Online minimax Q-learning algorithm 1: Start with a set of initially stabilizing feedback gains(K11,K12) // Initialization 2: For the given stabilizing feedback gains ( ˉK11,ˉK12),run Eqs.(26a)and(26b)until ?Wc,l+1(i+1)converges to Wc,l+1 // Policy evaluation 3: Using the obtained Wc,l+1, run Eqs. (29a) and(29b) simultaneously until ( ?K1,l+1(j), ?K2,l+1(j))converges to ( ˉKl+1 1 ,ˉKl+1 2 ) // Policy improvement 4: Stop if■■Kl+1 i -Kli■■≤ε (i = 1,2), where ε is a threshold; otherwise, set l=l+1 and go to step 2

4.2 Convergence analysis of the proposed online minimax Q-learning algorithm

In the following theorem, we will give the convergence analysis of the proposed minimaxQlearning method. Before deriving the main theorem,we first provide two lemmas that will be used in the following convergence analysis. The first lemma is taken from Ioannou and Fidan(2006),which is given as follows:

Lemma 2(Ioannou and Fidan, 2006) Consider a time-varying linear discrete-time systemyk+1=C(k)yk. Suppose that there exists a positive definite symmetric constant matrixMsuch that

for some matrix sequence{N(k)}and allk. If(A(k),N(k)) is also uniformly completely observable (UCO), i.e., there exist constantsα ＞0,γ ＞0,andl ＞0 such that for allk,

whereΦ(k+i,k)=C(k+i-1)C(k+i-2)...C(k+1)C(k)is the transition matrix of the linear system,theny(k)will converge to the origin exponentially.

Before stating the next lemma, we introduce one useful property of the Kronecker product on the matrix eigenvalue. Suppose thatAandBare square matrices of sizesnandm, respectively.Letλ1,λ2,...,λnbe the eigenvalues ofAandμ1,μ2,...,μmbe those ofB. Then the eigenvalues ofA ?Bareλiμj(i=1,2,...,n,j=1,2,...,m).

Lemma 3Consider a time-varying linear discretetime system given byzk+1=(I-2η(θkθTk)?E)zk,where{θk}is a sequence of bounded column vectors,ηis a positive constant, andEis a positive definite matrix. Letθkbe persistently exciting andηbe small enough. Thenzkwill converge to the origin exponentially.

ProofLetW(k)=(I-2η(θkθTk)?E). Employing the Kronecker product,W(k) can be rewritten as, withWe first prove that ˉθkis also persistently exciting.Asθkis persistently exciting, there existα1＞0,α2＞0,andl ＞0 such thatα2I. Considering thatwe haveUsing the property of the Kronecker product on the matrix eigenvalue, we obtainλmin(E)α1I ≤LetM=IandThen we have

Next, we prove that (W(k), N(k)) is UCO.Consider the following system:

Clearly, system (31) is equivalent to the following system:

with output feedback

So, we can prove that system (32) is UCO instead.Becauseθkis bounded,is also bounded; thus,there existsa ＞0 such that≤ais satisfied for allk. Letη ≤1/(2a). We have 1≤2-2η≤2;therefore,

We are now ready to state and prove the following theorem:

Theorem 2Letandbe persistently exciting and the learning rateβsatisfy

ProofFirst, we prove thatwill converge to, ifis persistently exciting and the learning rateβsatisfies inequality (34). From Eq. (28), we know thatreaches a saddle point atobserving thatQl+1is convex inK1andK2. The first-order necessary condition implies

From the definition ofEqs. (35a)and(35b)can be rewritten as follows:

where

Define the following estimation errors:andCombining Eqs. (36a) and (36b) with Eqs. (37a) and(37b)gives the following error dynamics:

where

According to the Kronecker product, we know that bothandare also persistently exciting if(k) is persistently exciting. Define the following matrices:

Let the learning process start atk0, i.e.,j=k-k0+1. Let

Using the result from Lemma 3, we know that the following time-varying system

is exponentially stable if the learning rateβis selected such thatβ ≤min(1/(2a1),1/(2a2)) is satisfied,whereTherefore, there existγ(k0)＞0 andλ∈(0,1)such that, whereis the state transition matrix of system (40). Clearly,(k),B,D, andPl+1are bounded; thus, there exists a positive constantθsuch that‖Δ(k)‖ ≤βθis satisfied for allk. We now rewrite Eq. (39)as

ThenZ(j) can be determined as

Taking the Frobenius norm on both sides of Eq. (42), we have

LetT(j)=Z(j)λ-(j+1). Inequality(43)can be rewritten as

Employing the Gronwall inequality, inequality (44)gives

Taking logarithms on both sides of inequality (45)yields

Note that ln(1+x)≤xholds for anyx ≥0,and from inequality (46), we can further obtain

SubstitutingT(j) =Z(j)λ-(j+1)into inequality (47)yields

Letbe the transition matrix of Eq. (39). Clearly,

whereek(k= 1,2,...,n(m1+m2)) is thekthcolumn ofIn(m1+m2). According to inequality (48), we obtain

Let the learning rateβbe selected satisfying inequality (34). Then we havewhich means that system (39) is also exponentially stable. Therefore,will converge toexponentially. If, further,is persistently exciting,(i) will converge toWc,l+1. Using the result from Theorem 1, we know thatwill converge to the saddle feedback gains (-K*1,-K*2), ifandare persistently exciting and the learning rateβsatisfies inequality (34). The proof is completed.

5 Simulation study

In this section, we will use Algorithm 2 to design anH∞load-frequency controller for an electrical power system generator that suffers from load disturbance.

Consider the following fourth-order discretetime electrical power system:

wherexk= [xk1,xk2,xk3,xk4]T(xk1denotes the incremental frequency deviation,xk2the incremental change in generator output,xk3the incremental change in governor position, andxk4the incremental change in integral control), andwkis the load disturbance. The initial state is set to bex0= [4,3,-1.5,2.5]T. The system matrices are given as follows:

that is,the disturbance attenuation level is set to beγ=3.

Employing Algorithm 1, we obtain the optimal feedback gains for the controller and the disturbance:

Now we apply the minimaxQ-learning method developed in Section 4 to solve for theH∞loadfrequency controller. Note that the system matricesA,B, andDare not needed to design the controller. They are used only to simulate the system. The initial admissible feedback gains are selected asK11=K12= [0,0,0,0]. The learning rate is selected asβ= 0.1. The threshold to stop the algorithm is set to beε= 10-4. The state samples used for RLS tuning at each policy evaluation step are generated by the behavior policiesandThe state samples used for gradient tuning at each policy improvement step are generated by the behavior policieswhereωiis an integer randomly generated from [-50,50]. In fact, one can choose from a variety of behavior policies as long as the behavior policies are such that bothandare persistently exciting. At each policy evaluation step,we carry out 5000 tuning steps, while 3000 tuning steps are carried out at each policy improvement step. After eight iterations, convergence of Algorithm 2 is attained, and the convergent values are given as follows:

Fig. 1 Evolution of the controller feedback gain in the policy-iteration-based minimax Q-learning method, where = [ , , , ]T (References to color refer to the online version of this figure)

Fig. 2 Evolution of the disturbance feedback gain in the policy-iteration-based minimax Q-learning method, where = [ , , , ]T (References to color refer to the online version of this figure)

Fig.3 State evolution of system(51)by implementing uk =( ˉK81)Txk under disturbance wk =5exp(-0.16k)

Fig.4 State evolution of system(51)by implementing uk =( )Txk under the worst-case disturbance wk =( )Txk

Fig. 6 Evolution of the disturbance feedback gain Kl2 in the value-iteration-based Q-learning method,where Kl2 = [Kl2,1,Kl2,2,Kl2,3,Kl2,4] (References to color refer to the online version of this figure)

For comparison, we apply the value-iterationbasedQ-learning method (Al-Tamimi et al., 2007;Rizvi and Lin, 2018; Valadbeigi et al., 2020) to solve for theH∞load-frequency controller. The initial value matrix is chosen asH=I6×6, and the initial feedback gains for the controller and the disturbance are selected asK11= [0,0,0,0] andK12= [0,0,0,0], respectively. After 50 iterations,convergence of the value-iteration-basedQ-learning method is attained,with the convergent values given as follows:

Clearly, (K150,K250) is also close to the saddle feedback gains(-K*1,-K*2). Figs.5 and 6 show the convergence ofKl1andKl2, respectively. It is observed that both the policy-iteration-based minimaxQ-learning method and the value-iteration-basedQlearning method converge to the saddle solution. Obviously, compared with the value-iteration-basedQlearning method, it takes far fewer steps of iteration for the policy-iteration-based minimaxQ-learning method to converge.

6 Conclusions

TheH∞control problem for linear discretetime systems has been investigated in this paper. A policy-iteration-based minimaxQ-learning method has been developed to learn theH∞controller online by using the state samples generated by the behavior policies, without querying the system model.By employing a normalized gradient method,a novel policy improvement scheme has been proposed. The rigorous convergence analysis of the proposed minimaxQ-learning method has been established under some persistence of excitation conditions and learning rate constraints. In addition,the excitation noise bias problem has been overcome. The simulation results demonstrated the good disturbance rejection capacity of the obtainedH∞controller. In future work,we will exploreQ-learning approaches forH∞control of nonlinear discrete-time systems.

Fig.5 Evolution of the controller feedback gain Kl1 in the value-iteration-based Q-learning method, where Kl1 =[Kl1,1,Kl1,2,Kl1,3,Kl1,4] (References to color refer to the online version of this figure)

Contributors

Xinxing LI and Lele XI designed the research, conducted the investigation, and drafted the paper. Wenzhong ZHA and Zhihong PENG supervised the research, helped organize the paper, and revised and finalized the paper.

Compliance with ethics guidelines

Xinxing LI, Lele XI, Wenzhong ZHA, and Zhihong PENG declare that they have no conflict of interest.

Frontiers of Information Technology & Electronic Engineering2022年3期

Frontiers of Information Technology & Electronic Engineering的其它文章: Monopulse transmitarray antenna fed by aperture-coupled microstrip structure＊; Interval type-2 fuzzy logic based radar task priority assignment method for detecting hypersonic-glide vehicles＊; Caustics of developable surfaces*; Sampling formulas for 2D quaternionic signals associated with various quaternion Fourier and linear canonical transforms*#; Efficient normalization for quantitative evaluation of the driving behavior using a gated auto-encoder＊; A novel multiple-outlier-robust Kalman filter*

亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放