亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

Deep reinforcement learning and its application in autonomous fitting optimization for attack areas of UCAVs

2020-09-07 09:21:06LIYueQIUXiaohuiLIUXiaodongandXIAQunli

Journal of Systems Engineering and Electronics 2020年4期

LI Yue,QIU Xiaohui,LIU Xiaodong,and XIA Qunli

1.School of Aerospace Engineering,Beijing Institute of Technology,Beijing 100081,China;2.Science and Technology on Electro-Optic Control Laboratory,Luoyang 471000,China;3.Beijing Aerospace Automatic Control Research Institute,Beijing 100854,China

Abstract:The ever-changing battle field environment requires the use of robust and adaptive technologies integrated into a reliable platform.Unmanned combat aerial vehicles(UCAVs)aim to integrate such advanced technologies while increasing the tactical capabilities of combat aircraft.As a research object,common UCAV uses the neural network fitting strategy to obtain values of attack areas.However,this simple strategy cannot cope with complex environmental changes and autonomously optimize decision-making problems.To solve the problem,this paper proposes a new deep deterministic policy gradient(DDPG)strategy based on deep reinforcement learning for the attack area fitting of UCAVs in the future battle field.Simulation results show that the autonomy and environmental adaptability of UCAVs in the future battle field will be improved based on the new DDPG algorithm and the training process converges quickly.We can obtain the optimal values of attack areas in real time during the whole flight with the well-trained deep network.

Key words:attack area,neural network,deep deterministic policy gradient(DDPG),unmanned combat aerial vehicle(UCAV).

1.Introduction

In the future battle field environment,preserving air superiority is very important and that unmanned combat aerial vehicles(UCAVs)joining the air combat will realize real zero casualty for pilots[1–4].UCAVs are widely used in the future of air combat.Combined with advanced technologies,future UCAVs will be able to operate in highly contested airspace in pre-emptive and reactive roles such as suppression and destruction of enemy air defences,penetrating surveillance,and strike of high-value targets[5,6].Furthermore,future capabilities are likely to extend on the operational speeds and maneuverability requirements to include supersonic regimes[7,8].In the environment of high altitude and high speed,improving the accuracy of UCAVs’attack area fitting and autonomous decisionmaking ability to adapt to environmental changes become the key to the future development of UCAVs[9–12].

The value of the attack area is a key parameter that affects the operational performance of UCAVs[13].The value of the attack area refers to the range where the missile can hit and damage the target with a certain probability under certain environmental parameters,including the height,velocity of the missile and the target,the ballistic inclination angle,the entry angle and the off-axis angle.The range includes the farthest distance and the nearest distance,which are called the far and near bounds in this paper.In the air combat environment,the attack area value of UCAV is obtained by the neural network acquired by learning a lot of data off-line.UCAVs today usually use the back propagation(BP)neural network fitting strategy to obtain the high accuracy value of the attack area[14,15].However,in the future of air combat the environment will change rapidly.The adaptability of neural network and the accuracy of the fitting values decrease in the complex battle field environment for the reason of fixed network structure and training data.Because the data is obtained from a fixed environment,the neural network obtained from offline learning is only applicable to a fixed environment[16].When the attack area is fitted in an environment different from the one from which the data is obtained,the fitting accuracy will inevitably decrease.The range of the real attack area will also change,and the error between the value obtained by neural network and the real value will change too.How to improve the strain capacity on neural network fitting for the attack area of UCAV becomes the key to enhancing the future air combat capacity.However,it is impossible for the aircraft to judge the air com-bat environment and update the fitting values of the attack area autonomously by the common neural network strategy[17–19].

The simple BP neural network strategy cannot cope with complex environmental changes and autonomously optimize decision-making problems.To solve these problems,the following measures can be taken.First,collect large amounts of simulation data in the attack area and obtain the network that can be used online by means of deep learning.Second,change the air combat environment,get the true value of the attack area in the new environment,and use the true value to correct the simulation value of the online network.Third,make the UCAV independently correct the attack area value by means of reinforcement learning.As an unmanned combat unit,UCAV needs to independently identify the environment in future complex air combat and use the correct optimization method to solve the problem of neural network inadaptability.This process involves data fitting and autonomous decision making.Deep learning(DL)is widely used in UCAV data processing[20–23].Reinforcement learning(RL)is widely used in UCAV autonomous decision making[24–26].

The strategy of deep RL is introduced to solve the problem that how UCAVs make an autonomous decision on adaptation of network when the air combat environment changes.In this paper,we solve the problem of real-time autonomous fitting optimization for UCAVs’attack area in future battle field environment by the deep deterministic policy gradient(DDPG)algorithm,which is a kind of deep RL algorithm.According to different continuous environment models,we can design the DDPG algorithm framework based on the python+tensorflow platform.The whole paper is divided into five section.The first section summarizes the role of UCAV and artificial intelligence technology in the UCAV field.The second section introduces the technological progress of DL,RL and deep Q network(DQN).In Section 3,BP neural network,DQN and DDPG algorithms are modeled and illustrated.In Section 4,simulation parameters are set and several algorithms mentioned in Section 3 are simulated and verified.Section 5 gives the conclusion about the simulation.

2.DL,RL,DQN and DDPG

DL is well known for its superhuman proficiency in air combat purposes where it is necessary to employ advanced computer intelligence that learns from data and makes intelligent decisions aimed at increasing profitability and sustainability.In future air combat environment,precise attack area value fitting of UCAVs still needs neural networks.Collecting and processing data is still important for aircraft.Because attack area value fitting needs large amounts of data and complicated network structure,the traditional monolayer neural network is difficult to deal with the complex data structure.Using the deep neural network is the key to solving this problem.The machine learning algorithm based on deep networks is called DL algorithm.Currently,many fields,such as image and voice recognition and information recognition,belong to the field of DL.

RL is another kind of artificial intelligence.It is RL that UCAVs need to improve the autonomous decisionmaking ability.RL is a branch of machine learning,where the machines gradually learn control behaviors via selfexploration of the environment[27].RL employs an actor that iteratively interacts with the environment and modifies its control actions to maximize rewards received from the environment.The main advantage of the RL algorithms is that it learns to optimize control policies by exploration of the environment independent of the linearity or multivariability of the system[28].The learned policy is obtained from numeric data of the reaction system,and it does not require parameter-tuning or real-time optimization,which allows RL easily adaptable to different control tasks once the framework is established.RL refers to a class of learning algorithms where a scalar evaluation of the performance of the network is available from the interaction of the network with the environment.RL aims at maximizing the expected evaluation by adjusting the parameters of the network.

In order to solve the problem of UCAV autonomous optimization of the far boundary of the attack area,we need to use both DL and RL.The DQN algorithm realizes the combination of RL and DL for the first time and makes remarkable achievements in practical application.In particular,the DQN algorithm introduces the technology of the objective function,the objective network and the experience playback in a pioneering way,which lays a solid foundation for the further development of deep RL.However,the DQN algorithm also has some limitations in practical application.For example,the DQN algorithm cannot deal with the continuous motion control problem,which greatly limits the application range of the DQN algorithm.Considering the shortcomings of the DQN algorithm,researchers have successively proposed more powerful deep RL algorithms,such as the DDPG algorithm for dealing with continuous motion control problems.

Lillicrap further proposed the DDPG algorithm that combines the DL strategy with the deterministic strategy gradient algorithm in 2016[29].This new algorithm has many advantages.For example,the experience playback mechanism is introduced to solve the problem of data correlation in this algorithm.The algorithm uses dual network structure,so that the learning process is more stable and the convergence is faster.

In order to solve autonomous decision-making and attack area fitting strategy problems of UCAVs under the condition of complex environment changes in the future of air combat,we adopt the DDPG algorithm that can deal with the problem of efficient continuous movement.In order to build the DDPG algorithm model for programming,we define the mathematical concepts involved in the autonomous decision-making problem of UCAVs.Since the principles of calculating the far and near boundaries of the attack area are the same,we take the far boundary fitting as an example.Choose UCAV as the agent,we define 10 typical air combat environments,which are recorded asE1,...,E10.

The fitting values of the attack area in different air combat environments are recorded asl(s)1,...,l(s)10.

The range is from 0 m to 200 m.The real values of the attack area obtained by program simulation are recorded asL(s)1,...,L(s)10.

To decrease the error between fitting values and real values,the agent chooses the suitable coefficient autonomously and optimizes fitting values calculated by neural networks.We define these coefficients as actions which can be recorded asa1,...,a10.

The range is from 0.8 to 1.2.The cost function can be recorded as

In the DDPG algorithm,the actor network and the critic network are updated at the same time.The actor network is used to update strategies,which refer to different coefficients adopted by UCAV.The critic network is used to provide gradient information and approximate the state action value function,which refers to the cost function.

The expected reward function can be expressed as

The target function of the DDPG algorithm is defined as the discount expectation of the cumulative reward:

whereJ(μ)is the cumulative reward,μis the independent variable of the actor network,r0,...,rnare expected rewards,γis the expected reward coefficient,and Eμexpresses mathematical expectation.

The action that maximizes the value of the target function is defined as the optimal action:

The process of training the value network is to find the optimal parameters in the value network to minimize the loss function:

whereLis the loss function,yis the real value function andQ(si,ai;θQ)is the optimal value function.θQexpresses the independent variable in the critic network.

In conclusion,the goal of the DDPG algorithm is to maximize the value of the target function and minimize the value of the loss function.

3.Algorithm framework and parameter setting

Firstly,UCAVs should be able to fit the values of the attack area.Through the program simulation of the flight process in the fixed environment,a large amount of data close to the real situation is obtained in the off-line state.These data can reflect the flight characteristics of UCAVs in the fixed environment.Secondly,we can obtain the far and near bounds of the attack area under different initial conditions by dichotomy or golden section.Thirdly,we process the data into learnable data,which can be divided into two parts,the input part and the output part.After that,we design the neural network structure conforming to the rules of input and output,and put the data into the neural network for learning.Finally,after repeated operations,the neural network tends to be stable and convergent,the neural network used for real-time fitting of the attack area can be obtained.

Fig.1 shows the designed neural network structure.The neural network includes the input layer,two hidden layers and the output layer.

Fig.1Designed neural network structure

The input layer has seven input parameters,which correspond to the height,velocity of the missile and the tar-get,the ballistic inclination angle,the entry angle and the off-axis angle respectively.Due to a large amount of data and the complex input parameters,in order to improve the learning effect of the BP neural network,it is necessary to design more than two hidden layers and appropriately design a number of neurons in the hidden layer.The output layer represents the far boundary of the attack area for a particular environment parameter.In this paper,we use the 7×5×5×1 neural network structure to process the data.

Fig.2 shows the network structure of attack area fitting.Input-factors of the neural network include the height,speed,elevation angle,off-axis angle of UCAVs,and height,speed and entry angle of the targets.Output-factors are far boundary values and near boundary values.By simulating the real trajectory of the UCAV,we can obtain a large number of accurate far and near boundary values with different input-factors.We record the data and use them in factors.We record the data and use them to train the neural network.When air environment changes,error between fitting values and real values may increase.To decrease the error,UCAVs should adopt the correction coefficient for different air combat environments.

Fig.2Attack area fitting network structure

In the common aircraft,the pilot can artificially modify the value of the attack area obtained by neural network fitting through sensing the change of the environment.The specific range of modification is determined by off-line experimental test and experience.The neural network used only in fixed environments can be used in a variety of different environments by means of numerical correction.The environmental perception and the decision-making by pilots become the key to improving environmental adaptation.As an unmanned combat unit,UCAV cannot make decisions by pilots.We need to train UCAV in off-line states in order to improve the autonomous decision-making ability of UCAVs.In combination with environment tags and modified strategy,we further process the data used for attack area fitting and put the processed data into the two networks which are needed for the DDPG algorithm.

Fig.3 shows the structure of the DDPG algorithm.Specific algorithm steps are as follows:

Step 1Randomly initialize the weights of the critic network and the actor network:θQandθμ.

Step 2Initialize the target critic network and the target actor network:Q?andμ?,and the weight parameter of the network is

Step 3Initialize the experience playback poolR.

Step 4Start a new round of RL and randomly initialize the process to search actions.

Step 5Get the initial state values0.

Step 6Start a new time step of learning.

Step 7Calculate the action of the current time step according to the current strategy with noiseNt:

Step 8Perform actions,record rewards and new statusat,rtandst+1.

Step 9Store conversion experience data in the experience pool(st,at,rt,st+1).

Step 10Randomly sample a small batch of converted experience samples(si,ai,ri,si+1)from the experience pool,and set

Step 11Minimize the loss function and update the critic network:

Step 12Update the actor network with the gradient strategy algorithm:

Step 13Update the target network:whereτis the weight parameter.

Step 14Repeat Steps 6–13.

Step 15When the round of learning is over,repeat Steps 4–14.

The block diagram in Fig.3 is the framework of the DDPG algorithm.The DDPG algorithm is a kind of RL algorithm combined with DL.The computational fitting process of the algorithm belongs to the category of DL.Its frame optimization process belongs to the category of RL.RL begins with a process that does not update the framework parameters.During this process,the empirical data pool will be filled.After that,the agent(UCAV)will select the action according to the experience pool,and the environment will change with the action and feedback the reward value.

Fig.3 DDPG algorithm structure

The actions mentioned in the algorithm refer to the optimization of the far-bound value of the attack area obtained by DL.The optimization method is to multiply the farbound value of the attack area by a coefficient ranging from 0.8 to 1.2.The environment mentioned in the algorithm refers to the new far-bound value after UCAV selects the action(optimizing the far-bound value of the attack area).The feedback value mentioned in the algorithm refers to the error between the new far-bound value and the real farbound value.

4.Simulation validation

First,we train the commonly used BP neural network.There are seven input parameters in total,and several special values are selected within a certain range.The selection is shown in Table 1.

According to the numerical selection scheme in Table 1,we can obtain a total of7×3×5×7×3×13×9=257985 combinations of environmental parameters.Select 250000 parameter combinations randomly,and the corresponding far boundary values of the attack area could be obtained by simulation.These data are the training set and the rest of the combinations with their far boundary values are the test set.

Training parameters include iteration times,learning rate,accuracy target,and we set them as 100,0.1,0.004.By training the BP neural network,we get a network that can be used online to fit the far boundary value of the attack area.

Table 1 Parameter values selection

Consider the influence of natural environment parameters(temperature,altitude,etc.)on the above online network.If the flight environment of the aircraft is the same as the network training environment.The high precision far boundary value can be obtained by the BP neural network.If the flight environment of the aircraft is different from the network training environment,e.g.,temperature,altitude and other parameters.Then the far boundary value of the attack area obtained by the BP network will occur with error.UCAVs can correct this error by the DDPG algorithm.

We first build the deep RL network framework according to the DDPG algorithm with the Python and tensorflow.The algorithm contains two networks:the critic network and the actor network.The critic network contains two hidden layers,each containing 400 neurons.The critic network uses the rectified linear unit(ReLU)activation function.The actor network also contains two hidden layers,each containing 400 neurons.The actor network uses the tanh activation function.Then,we record real values of UCAV’s attack area by simulation in 10 kinds of typical air combat environments and store data into the memory experience pool.Then,we put the value of the attack area calculated by the trained neural network in a fixed environment into the algorithm.These data are initial values that wait to be modified.Select the appropriate policy to modify these initial values.Build the actor network and the critic network,and then set learning steps.Each round of study includes 200 steps,so that the agent can make decisions autonomously 200 times in one round of study.

Record the feedback used to update the value network and the strategy network.There are 2 000 learning rounds,and the first 150 rounds are used to supplement the experience memory pool.When the pool is filled,we begin to train and update the network,until the desired effect is achieved.

The specific parameters of deep RL are shown in Table 2,and the detail information of 10 kinds of environmental states are shown in Table 3.

When the step in which the error between real values and fitting values is less than 1 000 m,we call the step as a useful step.In each round of learning,we should record the result that how many steps the round needs for 50 useful steps in total per round.We can get the simulation result shown in Fig.4 below after 2 000 rounds of learning.In the first 400 rounds of learning,results are stable within 200 times.After these 400 rounds of learning,the result begins to decline and quickly stabilizes within 80 times.

Table 2Simulation parameters

Table 3Information of environmental states

We train UCAVs so that they will get the ability of decision-making that what policy should be adopted in different environments.In each round of learning,UCAVs correct values of the attack area are obtained by the neural network gradually through the learned ability.The faster UCAVs reach the goal that accumulates to 50 useful steps per round,the stronger the ability that UCAVs learn.Fig.4 shows that with the help of the DDPG algorithm,after about 400 rounds of learning,UCAVs’ correction ability is greatly improved.Trained UCAVs can reach the predetermined goal under about 80 steps of operation.Considering that the single step time is very short,the training effect can meet needs of the engineering.After about 400 rounds of learning,UCAVs’correction ability is stable at about 80 steps,which indicates that UCAVs’learning ability is stable.Trained agents have been preliminarily equipped with artificial intelligence.

Fig.4Steps for targets in 2 000 rounds of learning

Fig.5 reveals how the environmental states change with time.Ten typical air combat environments are arranged in the time sequence to simulate a complete change process of air combat environments.In Fig.6,we compare the far boundary obtained by the common BP neural network with the far boundary obtained by the new DDPG algorithm and the DQN algorithm.In Fig.7,the errors of the far boundary are compared.Fig.6 and Fig.7 show that the BP neural network algorithm cannot adapt to the changes in the environment.DQN and DDPG algorithms can guide UCAV to modify the attack area value.DQN can only deal with discrete actions.In the same environment,the correction of DQN is not as accurate as that of DDPG.The simulation results in the figure also demonstrate that the error of the far boundary value of the attack area corrected by the DDPG algorithm is smaller.

Fig.5Environmental states changing with time

Fig.6Far boundary obtained by algorithms

Fig.7Far boundary errors changing with time

In Fig.8,we compare the near boundary obtained by the common BP neural network with the near boundary obtained by the new DDPG algorithm and the DQN algorithm.In Fig.9,the errors of the near boundary are compared.Fig.8 and Fig.9 indicate that the BP neural network algorithm can not adapt to the changes in the environment for fitting the near boundary.In the same environment,the correction of DQN is not as accurate as that of DDPG.The simulation results in the figure also indicate that the error of the near boundary value of the attack area corrected by the DDPG algorithm is smaller.

Fig.8Near boundary obtained by algorithms

Fig.9Near boundary errors changing with time

These simulation results show that the DDPG algorithm used for autonomous decision-making of UCAV in the field of attack area calculation in complex air combat environments is more effective than the DQN algorithm and the BP neural network.

5.Conclusions

The DDPG algorithm proposed in this paper can well meet the requirement that UCAV can autonomously optimize the boundary value of the attack area in the complex air combat environment.This algorithm can guide an agent(UCAV)to acquire the ability of autonomous decision-making through offline learning.It reflects the extensive application of artificial intelligence technology in the field of unmanned aerial vehicles.The DDPG algorithm has the autonomous learning ability that the traditional BP neural network algorithm does not have.Compared to other traditional deep RL algorithms,such as DQN,the DDPG algorithm can choose continuous actions to solve continuous problems.This is very necessary to solve the problem of rapidly changing air combat environment.In this paper,the DDPG algorithm is applied to the optimization of the far boundary of the UCAV attack area.Through simulation and comparison,we can get the conclusion that the DDPG algorithm is effective for solving related problems.At the same time,the DDPG application framework proposed in this paper can provide reference and help to other artificial intelligence problems in the field of unmanned aerial vehicles.

Journal of Systems Engineering and Electronics2020年4期

Journal of Systems Engineering and Electronics的其它文章: Dependence Rayleigh competing risks model with generalized censored data; Condition-based maintenance optimization for continuously monitored degrading systems under imperfect maintenance actions; Control allocation for a class of morphing aircraft with integer constraints based on L′evy flight; Extended state observer based smooth switching control for tilt-rotor aircraft; Underwater square-root cubature attitude estimator by use of quaternion-vector switching and geomagnetic field tensor; Multiconstraint adaptive three-dimensional guidance law using convex optimization