亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

Reinforcement learning based parameter optimization of active disturbance rejection control for autonomous underwater vehicle

2022-03-01 07:16:12SONGWanpingCHENZengqiangSUNMingweiandSUNQinglin

Journal of Systems Engineering and Electronics 2022年1期

SONG Wanping, CHEN Zengqiang,2,*, SUN Mingwei, and SUN Qinglin

1.College of Artificial Intelligence, Nankai University, Tianjin 300350, China;2.Key Laboratory of Intelligent Robotics of Tianjin, Nankai University, Tianjin 300350, China

Abstract: This paper proposes a liner active disturbance rejection control (LADRC)method based on the Q-Learning algorithm of reinforcement learning (RL)to control the six-degreeof-freedom motion of an autonomous underwater vehicle (AUV).The number of controllers is increased to realize AUV motion decoupling.At the same time, in order to avoid the oversize of the algorithm, combined with the controlled content, a simplified Q-learning algorithm is constructed to realize the parameter adaptation of the LADRC controller.Finally, through the simulation experiment of the controller with fixed parameters and the controller based on the Q-learning algorithm, the rationality of the simplified algorithm, the effectiveness of parameter adaptation, and the unique advantages of the LADRC controller are verified.

Keywords: autonomous underwater vehicle (AUV), reinforcement learning (RL), Q-learning, linear active disturbance rejection control (LADRC), motion decoupling, parameter optimization.

1.Introduction

Underwater robot plays a very important role in the development of marine resources and protection of marine rights and interests.Autonomous underwater vehicle(AUV)has been widely used in marine research and national security [1-4].In recent years, AUV has been successfully applied to complex underwater motion such as seabed imaging and seabed mapping [5,6].The motion of an AUV has six degrees of freedom.Because each operation on the AUV will have different degrees of influence on its various degrees of freedom, the AUV has the characteristics of strong coupling and strong nonlinearity.In addition, the complex dynamics of the underwater environment makes it more difficult to control AUVs [7].Therefore, it is of practical significance to control the movement of AUVs in accordance with the required performance.Many control methods in classical control theory, modern control theory and intelligent control theory have been applied to motion control of AUVs.For example, proportional-integral-derivative (PID)control, sliding mode control, fuzzy control and adaptive control, and many combination methods of the above methods [8-11].PID control is a feedback control based on error signals and is currently one of the main AUV control methods.However, when PID control faces a system with strong nonlinearity and strong coupling, the dynamic performance of the system is poor and the overshoot is large.Han proposed the active disturbance rejection control (ADRC)method in the 1990s [12,13].On the basis of this, Gao proposed a linear active disturbance rejection control(LADRC)method [14], which greatly reduced the number of parameters of the ADRC controller and made the whole system easy to debug and apply.ADRC has been applied to the control problems of fighter aircraft’s high angle of attack tracking [15], ship course control [16] and the power system [17], demonstrating its superior control performance.A good controller should have a certain degree of adaptive ability in resisting disturbances while having a good control performance [18,19], so the selection of parameters of the controller has been the focus of many experts and scholars.At present, many algorithms have been used to calculate controller parameters, such as the adaptive controller combined with fuzzy control algorithm can realize parameter self-adjustment [10,20,21].However, establishment of fuzzy rules depends on professional experience and model, so the application scope of fuzzy control is limited.Reinforcement learning hasbeen widely used in artificial intelligence and machine learning [22,23].Control algorithms based on reinforcement learning can optimize control strategies by interacting with unknown environments.The temproal-difference (TD)method in reinforcement learning is a modelindependent reinforcement learning algorithm.The algorithm updates strategies by updating the value function,and the new status and immediate rewards generated after the execution of the strategy are used to update the value function again.The TD method includes on-policy Sarsa algorithm and off-policy Q-learning algorithm [24].The effectiveness of Q-learning algorithm has been verified in many fields [16,17,25].At present, in most control method researches, the AUV model is always decoupled[23,26], so the authenticity of the controlled model is reduced.Therefore, the LADRC controller based on the Q-learning algorithm is used to control the six-degree-offreedom AUV model, and the related structure of the controller and the Q-learning algorithm are designed.Through Matlab simulation experiments, the control effect of the new controller is compared with that of the PID and LADRC controllers with fixed parameters.The results show that the LADRC controller based on the Q-learning algorithm can achieve the better control effect.

The main contributions of this paper are summarized as follows:

(i)The LADRC controller is adopted to stabilize the AUV system.

(ii)Controllers are added for motion decoupling: two LADRC controllers are used to control AUV motion in yaw and pitch planes.

(iii)The Q-learning algorithm is applied to realize parameter self-adaptation of the LADRC controller.

(iv)The state division and reward design of Q-learning algorithm are constructed for the controlled content.The scale of the algorithm is simplified and the effectiveness of the algorithm is guaranteed.

2.Motion and modeling of AUV

2.1 Coordinate frames and rigid body dynamics equation

Six degrees of freedom motion equations of AUV can be described using the earth-fixed coordinate frame and the body-fixed coordinate frame shown in Fig.1, both of which are right-handed.The origin of the body-fixed coordinate frame is located at the AUV center of buoyancy.

Fig.1 Coordinate frames and motion parameters

The motion of AUV can be described by these vectors:

where η describes the position and orientation of the AUV in the earth-fixed coordinate frame, υ describes the linear and angular velocities of the AUV, and τ describes the total forces and moments acting on the AUV in the body-fixed coordinate frame.The meanings of the symbols are summarized in Table 1.

Table 1 Symbols and their meanings

The coordinate transformation of the translational velocity between earth-fixed and body-fixed coordinate frames can be expressed as

where

The coordinate transformation of the rotational velocity between two coordinate systems can be expressed as

where

The positions of the AUV centers of gravity and buoyancy are defined in the body-fixed coordinate frame as follows:

According to the theory of rigid body dynamics, the motion equations of a six degrees of freedom rigid body defined by body-fixed coordinates are as follows:

wheremis AUV’s weight andIx,Iy,Izare the moments of inertia of massmof AUV to three coordinate axes.

2.2 Force and motion equatio n

At present, there are many submarine motion equations used in the world, but each equation differs only in mathematical description and mathematical processing method.The AUV model referred in this paper is a remote environmental monitoring units (REMUS)autonomous underwater vehicle [27].The total forces and moments acting on AUV can be expressed as follows:

whereXHS,YHS,ZHS,KHS,MHS,NHSare hydrostatics;Xu|u|,Yv|v|,Yr|r|,Zw|w|,Zq|q|,Kp|p|,Mw|w|,Mq|q|,Nv|v|,Nr|r|are hydrodynamic damping coefficients.Yuv,Yuuδr,Zuw,Zuuδs,Muw,Muuδs,Nuv,Nuuδrare lift coefficients and lift moment coefficients of body and control fin.Xprop,Kpropare propeller thrust and torque.δs, δrare the AUV’s pitch fin angle and rudder angle.The remaining coefficients are additional mass coefficients.

Substitute (4)into the right end of (3).Organize the formula so that all the left end of the formula are acceleration terms.The nonlinear equations of motion can be obtained after sorting out

2.3 AUV system model

The attitude of the REMUS vehicle is controlled by horizontal fins and vertical fins.The horizontal fins of the REMUS vehicle can control the pitching fin angle δs, sothat vehicle can carry out pitching motion.The vertical fins can control the rudder angle δr, to control the heading motion of the vehicle.In addition, this paper assumes that the propeller speed is constant at 1 500 rpm, and the REMUS vehicle maintains a speed of 1.51 m/s [27].

As can be seen from Fig.2, taking depth control as an example, the depth set valuez? is used as the controller input to get the appropriate control quantity.The input of AUV motion control is fin angle δs, rudder angle δr, and propeller thrustXprop.

Fig.2 AUV system work flowchart

3.LADRC controller

LADRC does not rely on the accurate mathematical model, and treats various uncertain factors in the controlled object as the total disturbance, uses linear extended state observer (LESO)to estimate the total disturbance and eliminate it, so as to suppress the influence of the disturbance [13].The LADRC controller for an n-order system is shown in Fig.3.

Fig.3 LADRC basic control structure

wherefis the total disturbance.Set the state variablex1=y,x2=f, thenis the extended state including disturbance.Equation (6)is transformed into the description of the extended state space

Construct an extended state observer for (7)to estimate the extended statex2[28] as

With a well-tuned LESO, we can get the estimate of the second state in (7), if the controller adopt the following form:

then (6)will be simplified as an integrator without dynamic uncertainty

Then a simplePcontrol can be employed as

where ωcis the controller bandwidth, andris the given value of the system.

Finally, (8), (9), and (11)are combined into a LADRC controller for first-order systems.

In her wrath5 she seized Rapunzel s beautiful hair, wound it round and round her left hand,42 and then grasping a pair of scissors in her right, snip12 snap, off it came, and the beautiful plaits lay on the ground. And, worse than this, she was so hard-hearted that she took Rapunzel to a lonely desert place,43 and there left her to live in loneliness and misery13.44

4.LADRC controller based on reinforcement learning

In recent years, reinforcement learning has attracted extensive attention.For a sequential decision making process with Markov property, through the interaction between agent and environment, the strategy is constantly updated and optimized to finally realize value maximization.

4.1 Q-learning

Given the five elements of reinforcement learning [24]:action setA, state setS, rewardR, attenuation factor γ,exploration rate ε, solve the optimal action value functionq*and the optimal strategy π*.The Q-learning algorithm has two strategies:

(i)Greedy strategy

Q-learning uses the greedy strategy to update the value function as follows:

(ii)ε -greedy strategy

The ε -greedy strategy is adopted to select new actions.By setting a value ε, the action that currently has the greatest action value is greedily accessed with the probability of 1 -ε, while the action is randomly selected from allmoptional actions with the probability of ε.

Q-learning uses this strategy to encourage exploration in action selection, so that as many actions as possible can be accessed.The steps of Q-learning algorithm are as follows:

Step 1Algorithm initialization: state setS, action setA, learning rate α, attenuation factor γ, exploration rate ε.

Step 2Initialize states∈S.

Step 3Use the ε -greedy strategy to select actionain the current state.

Step 4Perform actionain current statesto get new states′and rewardR.

Step 5Update value function

Step 6Learning ends when the termination condition is reached; otherwise, return to Step 3.

The complete algorithm flowchart is shown in Fig.4.

Fig.4 Q-Learning algorithm flowchart

4.2 Q-learning algorithm design

In this subsection, based on the established model of six degrees of freedom AUV model, Q-learning algorithm is combined to realize the design of the adaptive LADRCcontroller.

In order to implement parameter self-adaptation using the Q-learning algorithm, the dynamic parameter adjustment process is considered to be equivalent to the action selection process in the Q-learning algorithm.Therefore,reasonable state division and the design of reward functionRhave become important contents.There are two main considerations in controller design:

(i)Coupling between AUV heave motion and yaw motion.

(ii)With the increase of the types of states to be divided and the control parameters, the dimension of the state setSincreases and the Q table becomes larger,which will lead to an increase in the amount of calculation in the learning process.

Aiming at the first problem, the AUV yaw controller is considered to be added in this paper, so that the AUV can maintain course stability during the sinking process.

There are two main solutions to the second problem.The first one is to reduce the number of control parameters.According to (8), (9)and (11), the parameters needed to be adjusted by the LADRC controller are ωc,ωoandb0.It is worth mentioning that in the simulation experiment, the parameterb0can be approximated by the model calculation.For AUV system without time delay,b0can take the approximate value of the actual value of the system, while the LESO can still work normally[29,30].Therefore, the parameters to be adjusted in the adaptive LADRC controller are simplified to ωcand ωo,andb0is fixed according to model calculation and experience.Finally, the structure of LADRC controller based on the Q-learning algorithm is shown in Fig.5, where ωc,ωoand,are parameters of AUV depth and yaw controller respectively.

Fig.5 AUV control system based on Q-Learning

The second one is to design the state division method.In order to avoid doubling the dimension of the controlled state setScaused by the dual controllers, this paper constructs a Q-learning state division method based on the main controlled state and constructs a reward design method that is not limited to the error of the divided state.Taking AUV sinking depth and attitude angle θas the main controlled states, the division of the states is shown in Table 2, with a total of 25 states.eis defined ase=depth-state , where “d ep th” is set at 10 m and“st a te” is the real-time depth of AUV.The main function of this division is reflected in the “Initialize states” and“get new states′” processes on the left side of Fig.4.The yaw motion of AUV is taken as the secondary controlled state.Its state error and the main controlled state participate in the reward design in the value function.The process is embodied in the “calculate rewardR” on the left of Fig.4.

Table 2 Division of the states

Then, the four-dimensional parameter space which can be selected by ωc, ωo,, andis established.The parameter selection range here is

in total 2 401 parameter combinations are available.

After the above state and parameter division, the Q-table size of LADRC controller based on the Q-learning algorithm (Q-LADRC)is 2 401×25.Similarly, since there are six parameters to be adjusted for the dual PID controller, the scale of the Q-table of PID controller based on the Q-learning algorithm (Q-PID)is 117 649×25.

5.Simulation results analysis

As one of the reinforcement learning methods, Q-learning algorithm is most widely used.Therefore, Q-learning algorithm is compared with another reinforcement learning algorithm in this section, and it proves the advantages of Q-learning algorithm in some aspects.In order to verify the view that the controller parameterchanges caused by AUV state changes will improve the AUV control performance in the process of AUV sinking and resisting external disturbance, this section simulates AUV’s sinking motion and adds the disturbance of set values to verify the controller’s disturbance rejection performance based on the Q-learning algorithm.In addition, the LADRC method is compared with PID method to verify the superiority of LADRC method in some aspects.

5.1 Comparison between Q-learning algorithm and Sarsa algorithm

In addition to the off-policy Q-learning algorithm, the temporal-difference method in reinforcement learning also has the on-policy Sarsa algorithm.Sarsa algorithm adopts the ε -greedy strategy in both value function update and action selection.In order to conduct comparative experiments, the structural design and simplification of Sarsa algorithm are the same as the Q-learning algorithm in Subsection 4.2, which will not be repeated here.

Because Sarsa algorithm is relatively conservative in updating the value function, the convergence speed of the algorithm itself will be slower.Fig.6 shows the length of each episodes between the Q-LADRC controller and the Sarsa-based LADRC controller (S-LADRC)in 1 500 episodes.

Fig.6 Length of each episode

Due to the randomness of the AUV state at the beginning of each training and the complexity of the AUV movement, it may take a long time for individual episodes.Excluding the above influencing factors, it can be seen from the Fig.6 that the Sarsa algorithm adopts the random value update strategy, which makes the most of episodes longer in the later stage of convergence.

Increase the number of episodes, that is, increase the number of training, which enables S-LADRC to achieve similar control effect as Q-LADRC.As shown in Fig.7,when the number of training of S-LADRC controller reaches 4 500, it has similar depth control effect with the Q-LADRC controller after 1 500 times of training.

Fig.7 Depth control effects of two controllers with different training times

The Q-learning algorithm tends to maximize the Q value, while the Sarsa algorithm can avoid errors to a certain extent.Sarsa has a slow convergence speed, but it can improve the training effect by increasing the number of training times.

In the simulation experiment of AUV, it is found that the rapidity and smoothness of AUV sinking motion cannot be satisfied at the same time.Therefore, a Q-LADRC controller which can make the AUV motion smoother after Q-Learning training is adopted in the subsequent simulation experiment.

5.2 Parameters fixed controller and controller based on Q-learning algorithm

Fig.8(a)and Fig.8(b)show the control effect comparison between PID, LADRC controllers with fixed parameters and Q-LADRC controller.Fig.8(c)shows the changes of parameters caused by the state changes of the AUV when using the Q-LADRC controller.

Fig.8 Parameters fixed PID，LADRC controller and Q-LADRC controller

It can be seen from the Fig.8 that the LADRC method is effective in AUV motion control.The PID controller can quickly generate control quantity to meet the requirements of the system, but when the PID controller meets the speed, its control effect in the final stable state of the system is deficient.Compared with PID control method,LADRC gives AUV higher motion stability.

In addition, the data shows that the final depth and yaw error using the Q-LADRC controller are both less than 10-3m.It can be seen from the data and figures that adjusting parameters according to the state in real time has a positive impact on the control effect.At the same time,compared with the controller with fixed parameters, the QLADRC controller gives the AUV smaller yaw movement in the process of sinking.Therefore, although the yaw motion state of AUV is not divided in the learning process of the Q table, Q-learning algorithm can update the action value function according to the return value with yaw error, so as to successfully find the parameters of the yaw Q-LADRC controller.This proves that the state division method and reward design of the constructed Q-learning algorithm are reasonable and effective.

The Q-PID controller can achieve a similar control effect with the Q-LADRC controller on AUV sinking motion control.It will not be repeated here.However, the contradiction between its rapidity and stability still exists when it comes to disturbance rejection.

5.3 Change of set value

In order to compare the control effect of the controller in the face of abrupt state change, set value change and other uncertain factors, based on the AUV sinking control in the previous section, change the 80 s to 120 s depth setting from 10 m to 11 m.Simulation studies the control effect of parameters fixed LADRC, Q-LADRC, and Q-PID controllers.

When AUV resisting the disturbance of set value, the overshoot and oscillation of AUV can be reduced by proper parameter adjustment, and the control quantity is still kept in a reasonable range, as shown in Fig.9(a),Fig.9(b), and Fig.9(c).The fin angle of AUV using QPID controller changes quickly and responds quickly to the system.However, the disadvantages of PID controller are not changed, PID controller will cause system oscillation and severe overshoot due to excessive initial control force, it takes a long time for AUV to stabilize around the set value of the system.Fig.9(d)shows the parameter changes of Q-LADRC controller during AUV following the set value.

Fig.9 Parameters fixed LADRC, Q-PID and Q-LADRC controller

6.Conclusions

It is an important research topic to design a control method to make AUV have excellent motion performance.While using the LADRC controller to decouple the AUV movement, it also realizes the adaptive adjustment of the controller parameters combined with the reinforcement learning algorithm.Simplifying a part of the structure in the reinforcement learning algorithm avoids the “curse of dimensionality” to a certain extent.In a system with continuously changing state, constant adjustment of controller parameters is beneficial to the final stability of the system.At the same time, simulation experiments verify the effectiveness of the constructed Q-LADRC controller in AUV motion control.Although the value function update of Q-Learning algorithm is relatively risky, the algorithm has faster convergence speed and less time cost.Compared with the controller with fixed parameters, the AUV using Q-LADRC controller has lower overshoot and better motion performance in disturbance rejection.By comparing the control effects of PID and LADRC controllers, it is found that for slowly changing control objects such as AUV, when the control accuracy and stability of the controlled object have higher requirements,the LADRC method has higher applicability than the PID method.

Journal of Systems Engineering and Electronics2022年1期

Journal of Systems Engineering and Electronics的其它文章: Deep neural network based classification of rolling element bearings and health degradation through comprehensive vibration signal analysis; System reliability evaluation method considering physical dependency with FMT and BDD analytical algorithm; Sliding mode control of three-phase AC/DC converters using exponential rate reaching law; On-line trajectory generation of midcourse cooperative guidance for multiple interceptors; Robust adaptive control of hypersonic vehicle considering inlet unstart; Civil aircraft fault tolerant attitude tracking based on extended state observers and nonlinear dynamic inversion