A deep reinforcement learning method for multi-stage equipment development planning in uncertain environments

2023-01-03 10:13:52LIUPengXIABoyuanYANGZhiweiLIJichaoandTANYuejin

Journal of Systems Engineering and Electronics 2022年6期

LIU Peng,XIA Boyuan,YANG Zhiwei,LI Jichao,and TAN Yuejin

College of Systems Engineering,National University of Defense Technology,Changsha 410073,China

Abstract:Equipment development planning (EDP) is usually a long-term process often performed in an environment with high uncertainty.The traditional multi-stage dynamic programming cannot cope with this kind of uncertainty with unpredictable situations.To deal with this problem,a multi-stage EDP model based on a deep reinforcement learning (DRL) algorithm is proposed to respond quickly to any environmental changes within a reasonable range.Firstly,the basic problem of multi-stage EDP is described,and a mathematical planning model is constructed.Then,for two kinds of uncertainties (future capability requirements and the amount of investment in each stage),a corresponding DRL framework is designed to define the environment,state,action,and reward function for multi-stage EDP.After that,the dueling deep Q-network (Dueling DQN) algorithm is used to solve the multi-stage EDP to generate an approximately optimal multi-stage equipment development scheme.Finally,a case of ten kinds of equipment in 100 possible environments,which are randomly generated,is used to test the feasibility and effectiveness of the proposed models.The results show that the algorithm can respond instantaneously in any state of the multistage EDP environment and unlike traditional algorithms,the algorithm does not need to re-optimize the problem for any change in the environment.In addition,the algorithm can flexibly adjust at subsequent planning stages in the event of a change to the equipment capability requirements to adapt to the new requirements.

Keywords:equipment development planning (EDP),multistage,reinforcement learning,uncertainty,dueling deep Q-network (Dueling DQN).

1.Introduction

Equipment development planning (EDP) is an important part of national defense procurement and serves as a professional system that supports the entire decision-making process of a country’s military equipment,from equipment research &development (R&D) to final decommissioning.It thus plays a vital role in the planning of national defense,construction of future national defense forces,and the ability to deal with possible opponents in the future.Due to the limitations of existing technology and the cost constraints in the construction and development of weapon equipment,it is difficult for the weapon equipment to reach its planned capability immediately.Therefore,it is necessary to adopt “gradual procurement”and “spiral development” modes so that the capability requirements can be met in stages [1,2].In the meantime,as the military power of other countries increases,the complexity of the military tasks undertaken by the army increases rapidly,resulting in a continuous evolution of combat capability requirements.Therefore,the continuous development of weapon equipment is required to meet the evolving capability requirements.In addition,there is a close relationship between the capability levels in the various stages of the development process of weapon equipment.In other words,the investment in resources (such as cost and time) at any stage of weapon equipment development can affect the capability development at all subsequent stages.Hence,equipment needs to be developed in stages to gradually meet the capability requirements that are constantly evolving,and the planned capabilities of an equipment portfolio are closely linked between the various stages.These factors make it necessary to consider the planning and development of an equipment portfolio from a multi-stage perspective.

The goal of weapon equipment development is to improve the combat capability or combat effectiveness of the weapon equipment system.Therefore,the uncertainty in the development of weapon equipment is different from that in general commercial issues,which stems from many unpredictable changes in the global pattern of threat scenarios,national security strategy adjustments,and key strategic directions of homeland defense.Currently,scenario-or probability-based planning is often used to deal with high uncertainties [3,4],which have resulted in different unexpected changes in capacity requirements and investment amounts at each stage of the equipment development process.The existence of the above-mentioned uncertainties makes it necessary to dynamically optimize the development plan during the entire planning period of weapon equipment to meet the corresponding capability requirements in any possible future scenarios.

Traditional multi-stage development planning facing future uncertainties generally is classified into two categories: (i) probability-based multi-stage uncertain decision-making,such as Markov decision-making process[5] and expected value planning [6],which is necessary to assign a certain probability to scenarios that may appear in the future and will generally only provide an optimal solution;(ii) stochastic programming,which is an active branch in mathematical programming and has been applied to optimization problems that involve uncertain data [7-9].Stochastic programming requires almost no assumptions for stochastic processes.Instead,it constructs a scenario tree to model the stochasticity of uncertain parameters and generates a set of feasible planning schemes for each possible scenario.Both methods expect the uncertain information to be understood to a certain degree,either through historical data or from expert experience.However,military opponents often use unexpected tactics to achieve the desired goal,making it very difficult to predict future scenarios.Although the strategic departments have tried their best to expand the scope of future forecasts,they still fail to definitively keep up with the rapid changes in today’s defense environment.The exact prediction of such highly complex and uncertain defense environments is impossible.Once the actual situation does not agree with the anticipated scenarios,certain adverse impacts and even serious consequences will follow.

Reinforcement learning (RL) is a dynamic programming method that can learn the best action policy from the interaction between the agent and changing environments.With the plus of deep learning idea,deep RL(DRL) can deal with the situation when facing an environmental input that is not included in the training set,because the deep neural network has a strong prediction capability.The trained agent can make a quick response to any environment without dependence on prior probability knowledge [10].Thus,DRL can deal with the situation that traditional dynamic methods cannot deal with.

The DRL idea is motivated from the combination of deep learning and traditional RL methods,which fell into a trough in the early 20th century because of the curse of dimensionality.In 2013,DeepMind first published a paper of DRL to play Atari games,leading to a second boom in the development and application of RL[11].In 2016,the Alpha-Go series developed by Deep-Mind shocked the world by defeating top human Go players in Go games,attracting a lot of attention.Since then,DRL has been widely applied to various dynamic programming problems,such as automatic robot control,personalized e-commerce services,and gameplaying [12].

Recently,scholars are trying to use DRL methods to solve the traditional optimization problems.This paper is motivated by them and tries to use the DRL methods to solve the multi-stage EDP problem in highly uncertain environments.As future environments are difficult to be fully identified,EDP in nature is a multi-stage programming problem set in highly uncertain environments.In this context,an advanced DRL algorithm of the dueling deep Q-network (Dueling DQN) is used for the first time to deal with the high uncertainty in EDP problems.The contributions of this paper are summarized as follows:

(i) A mathematical planning model is constructed to describe the problem of multi-stage EDP.The established model takes full consideration of the limited budget and multiple investments in a long-term planning process.

(ii) Considering two uncertainties (future capability requirements and the number of investments in each stage),a corresponding Dueling DQN framework is established which has five factors of state,action,reward,state transition,and basic information.

(iii) A Dueling DQN-based algorithm flow is proposed to generate an approximately optimal multi-stage equipment development scheme.

This paper is organized as follows: Section 2 analyzes the current application of stochastic programming algorithms and multi-stage programming to national defense in uncertain environments,as well as the application of DRL algorithms to multi-stage decision-making.Section 3 addresses multi-stage EDP issues.Section 4 proposes a Dueling DQN algorithm to solve the multi-stage EDP problem.Section 5 uses a test case to demonstrate and analyze the model performance.Section 6 presents the concluding remarks.

2.Relevant research

2.1 Stochastic programming algorithms for uncertain environments

Stochastic programming can be traced back to the uncertain linear programming model proposed by Dantzig in the 1950s [13].Since then,classic application cases of stochastic programming have continuously emerged,such as Eppen et al.using stochastic programming to study the production planning of general motors in the united states [14].However,it has not been paid much attention,due to its huge computation loads at that time.With the rapid development of computer computation speed and intelligent optimization algorithms over the past ten years,the computational complexity of stochastic programming has been greatly reduced.Therefore,stochastic programming has increasingly been studied;its theoretical methods have became a major research topic in the field of operational optimization and been successfully applied in many fields,such as project planning,product planning,and asset portfolio optimization[15-19].

When using stochastic programming techniques to solve the practical problems,the focus is generally on scenario tree generation methods and stochastic programming model solutions.

The mainstream research on scenario trees is generally based on sufficient historical data [7,20] and a known probability distribution of stochastic parameters [21,22].When high model accuracy is not required for a large amount of random sampling,Monte Carlo discretization is generally used for the model construction [23].

Due to the uncertainty in scenario tree-based stochastic programming,the dimension of its decision variables and the scale of the solution space will increase exponentially as the branches of the scenario tree increase in each stage,making the computational complexity of the problem increase greatly,or a non-deterministic polynomial(NP)-hard problem.To obtain the global optimal solution of a stochastic programming model,most studies in this field have explored rigorous mathematical programming methods to solve stochastic programming models that meet certain strict constraints or have specific mathematical structures [24-32].Scholars have explored the strategy of combining pure mathematical programming methods and heuristic algorithms to solve stochastic programming models [33-37].However,in practical engineering applications,as the scale of stochastic programming problems is getting increasingly large,and the mathematical structure of the problem is becoming more and more complex or too difficult to exactly define,an increasing amount of research efforts are being made to explore how hyper-heuristic algorithms/intelligent optimization algorithms can be used to obtain a satisfactory solution within an acceptable time limit.The heuristic algorithms used so far in the literature include genetic algorithms based on integer stochastic programming [38],ant colony algorithms [39],particle swarm optimization algorithms[40,41],tabu search algorithms [42],and differential evolution algorithms [43].

2.2 Application research of multi-stage programming in national defense

Multi-stage programming is widely used in investment portfolio optimization [44,45],drug R&D [46],biofuels[47],and rescue logistics [48].In recent years,the research and application of multi-stage programming in national defense have also received more attention.For example,Chan et al.[49] conducted planning for a series of years so that the initial and subsequent acquisition decisions could be distinguished and treated differently.Whitacre et al.[50] investigated a static planning period of ten years.Golany et al.[51] explored a multi-stage approach for the resource allocation problem.Xiong et al.[52,53] addressed the multi-period scheduling problems.Rempel et al.[54] considered a 20-year planning period and another 20-year budget period.Wang et al.[55] considered a 15-year planning period divided into three fiveyear cycles.Moallemi et al.[56] considered the submarine acquisition and crew allocation in different epochs(each composed of several weeks) over a three-year duration.Xia et al.[57] studied a multi-stage weapon planning problem.

Combinatorial optimization is essentially dynamic programming.However,most studies on multi-stage planning problems divide the planning duration into several periods for separate consideration.This strategy is focused on the static background [49-52].To date,few studies have been conducted on how time-varying problems can be modeled.

A more comprehensive study will involve introducing a dynamic or time-varying factor,which will consider a problem that is changing over time.Brown et al.[58]believed that pairwise interactions might occur in one or more periods in the future.Tsaganea [59] used the dynamic system theory to model missile defense problems.The state of a dynamic system directly depends on the previous state.Baker et al.[60] considered dynamic multi-agent simulation to evaluate a scheduling scheme for candidate fleets.Xin et al.[61] proposed a model to solve the dynamic weapon-target assignment problems that considered the enemy’s attack strategy as a dynamic factor as well as the uncertainty of the attack success.In some studies,the cost or value associated with a project is time-dependent [62,63].Similarly,Zhang et al.[64]pointed out that joint development of synergistic projects would lead to a synergistic reduction in costs.Xiong et al.[53] determined available budgets for each period according to the discount coefficient (depending on the number of months in the past).Shafi et al.[65] addressed the uncertainty of capability requirements,available budget,and strategic risk across different stages in capabilitybased planning.

There has been no obvious progress in mathematics for multi-stage optimization problems-dynamic programming serves as the most used and effective method for solving the problems.Dynamic programming is a very complex combinational optimization problem.The main difficulty of dynamic programming is that the number of decision variables of the system is several times that of the decision variables of single-stage programming,thus large-scale systems have many decision variables and constraints,which increases the difficulty and time taken to find a solution.Moreover,because multi-stage planning considers not only the transition of planning schemes between multiple planning stages but also the constraints on decision variable values at each stage,the model has high computational complexity and difficulty.In addition,when the demand is predictable and known in advance,mathematical programming can give an optimal solution.However,when the demand is random and unpredictable,even robust optimization and stochastic programming would fail to find a reliable solution in practice.

2.3 Application of DRL algorithms in multi-stage decision-making

Since model-free DRL has the advantage of allowing decision-making without relying on the transition probability distribution or knowing the demand probability distribution [66],more and more studies have used this technique in recent years.Several DRL-based methods have been proposed to solve the resource allocation and scheduling problems,such as cluster resource management [67],equipment layout optimization [68],satellite power allocation [69],and railway scheduling [70].In video games,DRL is often used to build a model that allows the computer to learn game rules to a high level independently of human help [11,71-75].At present,many large companies have opened corresponding test platforms so that the public can test the performance of the company’s DRL algorithms.Due to the incomplete observability and the instability of the state space,the application of DRL to robot control is still in its preliminary stage [76,77].DRL algorithms for continuous states and actions were proposed in [78],and have been successfully applied to the robot control.Gu et al.[79] proposed an asynchronous DRL method,which allows real physical robots to learn complex manipulation skills from scratch through the training and learning of nonlinear deep neural network policies,with benefits over shallow representations for complex tasks.As a kind of self-learning intelligent control algorithm,DRL is very suitable for solving the control problems of complex nonlinear systems of vehicles,endowing it with broad application potential for intelligent driving [80-84].DRL algorithms can effectively overcome the limitations of poor generalization ability and high model complexity faced by traditional algorithms,and have good adaptability,intelligence,and generality.On the one hand,deep learning algorithms can effectively reduce the dimensionality of input data through unsupervised feature learning;on the other hand,adaptive dynamic programming can solve the control problem of continuous state-action systems.The two advantages combine to achieve good results in path planning [85,86] and visual control [87].

3.Description of multi-stage EDP problem

Equipment development is essentially a balance between capital investment and equipment capabilities.In the case of sufficient funds,decision-makers invest in the development of all equipment to maximize the equipment capability.However,in practice,decision-makers must give up developing some equipment because the budget is often limited.Furthermore,the funds required for equipment development are often not invested all at once,but in stages in a long-term planning process.Therefore,decision-makers need to decide which equipment they want to invest in at each stage to achieve the maximal capability in the end.The multi-stage EDP problem is formulated in the following.

3.1 Symbolic definition

Firstly,necessary variables are defined symbolically,as shown in Table 1.

Table 1 Symbolic definition

Some important variables are further elaborated as follows:

3.2 Evaluation of multi-stage EDP scheme

Capability is an important metric for evaluating the scheme of multi-stage EDP.Unlike single-stage investment portfolios,multi-stage programming needs to be evaluated for the equipment portfolio of the planning.

Assume thatX,a multi-stage EDP scheme,is effective,that is,it meets the funding constraints.One needs to consider the following at the end of a specific stage:

(i) The number of years that the equipment has been developed,as described by

(ii) Whether the equipment has been successfully developed,as described by

wheresiis the development state of Equipmentwi.si=1 means thatwihas been successfully developed,andsi=0 the opposite.wiis expected to be developed lαiyears,thenwiis regarded as successfully developed only if it has been developed more than the expectedlαiyears.

(iii) Capabilities after the current stage,as described by

where only the capabilities of successfully developed equipment are considered,and other capabilities are set to zero.For each equipment capability,take the maximum value of all successfully developed capabilities.In other words,if multiple successfully developed equipment has the same capability,the maximum value is taken as the final value of the capability.

(iv) Overall evaluation of the equipment capability,as described by

whereQis the overall capability performance considering the capability requirements.

3.3 Multi-stage EDP model

As indicated above,the future environments for EDP are highly uncertain.Two main uncertain factors are considered in this paper: the amount of investment at each stage,and the final capability requirement.Therefore,for differentBand Rβ ,the same development schemeXleads to different results.Therefore,the overall evaluation of equipment capabilities mainly depends on one variable-equipment development scheme-and two uncertain factors-investment amount at each stage and the final capability requirement.Accordingly,the multi-stage EDP model is as follows:

where (5) defines the objective function as maximizing the overall capability performance.Equations (6)-(9)explain corresponding variables of the mathematical model.Equation (10) constraints that the equipment portfolio development plan in each stage cannot exceed the investment in this stage.In (10),one stage equalsψ years.(ei/lαi)min(lαi,ψ) means the cost of one stage spent by equipmentwiif it is selected to be developed in this stage.ei/lαiis the cost of one year.If lαi＞ψ,then the cost of this stage is (ei/lαi)ψ,else if lαi≤ψ,the cost of the stage isei.

4.DRL model for the multi-stage EDP

4.1 Multi-stage EDP environment

As shown in (11),the multi-stage EDP environment mainly includes five factors of state,action,reward,state transition,and basic information.A state is what an RL agent can observe from the environment.Action is the decision space of a state.A reward is a response when an RL agent takes a certain action under a certain state.State transition denotes how the current state changes when the agent chooses an action.The basic information includes the total amount of equipment to be developed,the cost of equipment to be developed,the expected development years of the equipment to be developed,the number of concerned capabilities,the capability of the equipment to be developed,the final capability requirement,and the investment budget for each stage.

Next,the definition of the state,action,reward,and state transition (next state) will be introduced respectively.

4.1.1 State

State information is defined as what the RL agent can observe in the process of equipment development as follws:

wherebstagerepresents the amount of investment at stage=1,2,···,t,Lβstagerepresents the number of years taken to develop the equipment andSstagerepresents whether the equipment has been developed successfully.Given the state at the current stage,a learning process will be implemented to decide what actions need to be taken at the next stage to ultimately achieve the optimal performance,that is the maximalQin (5).

4.1.2 Action

At s tage=1,2,···,t,the action is defined by

4.1.3 Next state

Let the equipment state at a development stage=1,2,···,t-1 be statestage.After a ctionstageis implemented,the state at the next stage is statestage+1,then

In (14),there are two main situations for updating the current state.One situation is that the cost of the action on the current stage exceeds the investment budget.This situation will not happen in the real planning problem because it is strictly constrained.In RL,however,this kind of strict constraint is not supposed to be put on the RL agent because it will limit the exploration capability of the RL agent.Therefore,we expect the RL agent can understand that choosing an equipment portfolio that exceeds the investment budget is not encouraged.To realize this,the selected equipment portfolio will not be developed if the cost exceeds the investment budget.At the same time,a penalty will be imposed on the agent by giving it a negative reward,as the last line in Code 1 shows.

Another situation is that the cost is within the budget,and the equipment development state update will be affected by the action (the equipment invested at the current stage).In this situation,the development years of the selected equipment will increase by one.Once the desired development year length is reached,the equipment is regarded as successfully developed.Besides,the remaining budget will not be transferred to the next stage,so the RL agent can learn to maximize the use of investment at each stage.

4.1.4 Reward

In RL,reward is a very important component.The purpose of reward is to help the RL agent evolve and learn in the desired direction.To teach the RL agent to make a correct decision in each stage and to have a long view,two rewards are designed: (i) the instant reward for each action;(ii) the episode reward for the final performance.The instant reward is designed as shown in Algorithm 1.

The episode rewardQis calculated based on the overall evaluation of equipment capabilities mentioned in Section 3.Since the episode reward is a result of all actions in the entire round,it needs to be redistributed.The redistribution is performed through reward discounts as follows:

where 0 ＜γ ＜1 is the discount factor,d_rstageis the discounted episode reward in the stage.It is assumed that the earlier the stage,the lower its contribution to the episode reward,and the greater the discount.The final reward of each action is the sum of the corresponding instant reward and discounted episode reward.

4.2 Dueling DQN-based EDP problem solving

4.2.1 Algorithm framework

Combining the features of Dueling DQN and EDP,we design the RL framework to make a clear map for the whole learning process.As shown in Fig.1,the algorithm steps are as follows:

Fig.1 DQN-based multi-stage EDP algorithm framework

Step 1Putting the current observation into the prediction network.

Step 2For the current observation,using the prediction network to predict values of all actions in the action list.

Step 3Using the ε -greedy α to choose an action(choosing the action with the maximal value with probability α,randomly choose an action with probability 1-α).

Step 4Using the selected action to interact with the multi-stage EDP environment.

Step 5Generating the next observation and the reward.

Step 6Putting the current observation,action,next observation,reward into the memory bank.

Step 7Updating the next observation to the current observation.

The above steps are the processes of the RL agent interacting with the multi-stage EDP environment.Based on the track data stored in the memory bank,the neural network can be trained in the following steps.

Step 8Sampling certain records in the memory bank as the train set of the prediction network.

Step 9Putting the sampled data into the prediction and target network.

Step 10Using the prediction network to predict the value of sampled actions.

Step 11Using the target network to evaluate the value of sampled actions as the evaluation of the prediction accuracy of the prediction network.

Step 12Using the errors between the prediction and target networks on the same actions to train the prediction network.

Step 13Replacing the target network with the prediction network after eachxsteps.

Although the Dueling DQN framework refers the work published in [88],we exhibit the Dueling DQN framework with a clear and detailed pattern,which is more understandable for researchers new to the RL field and those trying to use RL to solve issues in their respective areas.

4.2.2 Algorithm design

(i) State input and action output

The state input vector contains five elements: the current development state of equipment,the number of years taken to develop the equipment,the current stage,the investment amount at the current stage,and the capability requirement.All elements are normalized using methods according to their types,as shown in Table 2.

Table 2 State input types and normalization methods

There are two choices for each equipment at each stage: investment and non-investment.Therefore,there are 2minvestment portfolios from which to choose as the action output ofmequipment.

(ii) Network structure design

Neural networks in common use at present mainly have three types: feed-forward neural networks (FNN),convolutional neural network (CNN),and recurrent neural network (RNN).The FNN is appropriate for processing tensor data.The CNN is suitable for the image with pixel data.The RNN is suitable for processing sequential data.Considering the state input in the paper is a set of tensor data,the FNN is used as the predictor in the Dueling DQN algorithm.

In the Dueling DQN,the neural network structure is improved based on DQN.The original DNN network is expanded with a layer,as shown in Fig.2.The hidden layer is connected to two separate parts of value and advantage.Then,the two parts are combined and fully connected to the output layer.In addition,the same to DQN,the prediction network,and the target network will share the same network structure and the prediction network will replace the parameters of the target network in fixed steps.

Fig.2 Neural network structure

4.2.3 Network training algorithm

The Dueling DQN-based DRL model for multi-stage EDP optimization refers the algorithms published in [88],as shown in Algorithm 2.It is the same with the algorithm of DQN except for the network structure and the calculation method ofQvalue.

In the training process,the network parameters are circularly updated by minimizing the gaps between the predicted rewards and the real rewards,as shown in (19).TheQvalue of Dueling DQN is different from that of the nature DQN,as shown in (20).

whereQsj,aj;θ,α,β is the reward prediction by the prediction network.The value function part and advantage function part share the same parameter of θ,and have their parameter of α and β,respectively.yjis the real reward.lossiis the loss function of the prediction network.θ?,α?,and β?are the updated neural network parameters by minimizing the loss function,that is the gap between the predicted reward and the real reward.

After enough times of training,the network will have a certain capability to respond to new states not included in the training dataset.This capability makes the RL agent possible to respond to any observation in the testing environment.

5.Case study

5.1 Test case description

The case is originated from a real multi-stage programming problem of an ongoing project.The data in the case has been simplified and encrypted to support the academic research.The case includes ten types of equipment to be developed within ten years.Each year is a stage in which decision-making will be conducted to decide which equipment should be developed.Ten types of capabilities are required for the equipment and the extent to which the successfully developed equipment at the last stage meets the capability requirement is used as a metric for evaluating the development scheme,as shown in (4).Equipment parameters are fixed,including the equipment cost and the expected number of years to be developed (see Table 3).The expected levels of ten types of capabilities at the end of the development stages are shown in Table 4.The capacity requirements and the amount of the investment at each stage are pre-unknown variables.

Table 3 Equipment names and the cost and expected development period

Table 4 Expected capability level of equipment

For the two uncertain factors,the value range of the required capability is [3,8],and the range of the amount of investment at each stage is [51,81].Since we hypothesis that there is no prior experience about the two uncertain factors,they are unpredictable and completely random.

5.2 Result

5.2.1 Sensitivity analysis

First,we make a sensitivity analysis on the ε-greedy and hidden nodes numbers of the Dueling DQN algorithm.Four ε-greedy values and ten hidden node numbers are set to compare the loss value and reward value,as shown in Fig.3-Fig.6.

Fig.3 Loss value comparison on the ε-greedy

Fig.3 and Fig.4 show the average loss and reward value under every 100 times iterations of different ε-greedy values in 50 000 training iterations.The two figures indicate that the best ε-greedy value in the Dueling DQN is 0.8,as the black line shows.For the reward,when the realized capability cannot meet the required capability,a negative reward will be imposed on the artificial intelligence (AI) as a penalty.On the contrary,a positive reward will be given to the AI if the realized capability meets the required capability,and the higher realized capability will generate a bigger reward.

Fig.4 Reward value comparison on the ε-greedy

Fig.5 and Fig.6 show the average loss and reward value under every 1 000 times iteration of different hidden nodes numbers in 50 000 training iterations.We make an average comparison under 1 000 times rather than 100 times because the difference caused by hidden nodes number is not as evident as that caused by ε-greedy.For the ten values of hidden nodes number,the values of 400 and 600 perform better than other values.Taking reward for the priority,we take 600 as the best value of hidden nodes number.Therefore,the following tests will be made under the condition of “ ε-greedy=0.8”and “hidden nodes number=600”.

Fig.5 Loss value comparison on the number of the hidden node

Fig.6 Reward value comparison on the number of hidden node

5.2.2 Result comparison

Next,to indicate the advantage of the Dueling DQN,it is compared with the nature DQN.According to the sensitive analysis result,the ε-greedy value is set to 0.8,the number of the hidden nodes is set to 600 for the two algorithms.Besides,the input layer has 61 nodes and the output layer has 1 024 nodes.Through 50 000 training,the average loss and reward values under every 100 times iterations of the two algorithms are shown in Fig.7 and Fig.8.

Fig.7 Loss value comparison

Fig.8 Capability evaluation during the learning process

The result shows that the loss values of both the DQN and the Dueling DQN can converge to a stable level.However,the Dueling DQN converges faster than the DQN.

According to Fig.8,both the DQN and the Dueling DQN can learn an effective scheme to meet the capability requirements and converge to a relatively stable level of the reward.However,Dueling DQN has a higher average reward and performs better than the DQN.

5.2.3 Result analysis

To verify the agent performance after RL training,100 environments are designed by randomly generating the final capability requirement and planned investment.For each environment,the equipment development scheme is obtained using the trained RL agent.At the same time,to examine the stability of the RL agent,we use a heuristic optimization algorithm: differential evolution (DE),which has acknowledged advantages in optimizing large-scale optimization problems,as a benchmark algorithm to compare with the RL agent results,as shown in Fig.9.

Fig.9 Capability evaluation in 100 test environments

In Fig.9,the bars show the results of the RL agent and the DE algorithm on the 100 environments.The green lines indicate the reward of the RL agent.The red lines above the green lines are the gaps that the DE exceeds the RL agent.The average value of the gaps is 6.21,accounting for 2.17% of the average value of the RL agent result.The standard variance of the gaps is 2.912 0.The above analysis indicates that the RL agent has almost optimal performance and is relatively stable.

Although the training process is time-consuming for the RL agent,the trained agent can almost make decisions instantaneously for any state input.While for the DE algorithm,it must make new optimization for any new state-input because it does not have the capability of learning experience from history.

Next,we make a detailed analysis of the best and worst capability evaluation values of the 100 environments.

(i) The case with the optimal overall performance

The optimal reward is 40 in the 48th test environment,with the randomly generated environment of expected investment amounts and capability requirements at different stages,as shown in Table 5.

Table 5 Random environment for the 48th round of test

The equipment development scheme obtained by the RL agent is shown in Table 6,where the column ofsiindicates the equipment state (0: not developed;1: developing;2: successfully developed) at stagei,and the column ofaiindicates the agent decision for the stagei(black dots indicate that the equipment is to be invested at the stage,while white dots indicate no investment on the equipment).

Table 6 AI-suggested equipment development scheme in the 48th round of test

In the environment shown in Table 5,the above scheme can successfully develop equipment 1,2,3,4,5,6,8,and 9.The final capabilities after ten-stage development are 9,8,7,7,8,7,8,8,8,and 9,respectively,and all of them meet the corresponding capability requirements.

(ii) The case with the worst overall performance

The worst reward is 15 in the first test environments,where the randomly generated environment of expected investment amounts and capability requirements are shown in Table 7.Under this environment,the equipment development scheme generated by the RL agent is shown in Table 8.

Table 7 Random environment for the first round of test

Table 8 AI-suggested equipment development scheme in the 14th round of test

In the environment shown in Table 7,the above scheme can successfully develop equipment 1,2,3,4,5,8,and 9.The final capabilities are 9,8,7,7,8,2,8,8,3,and 9,respectively,where the sixth and ninth capabilities cannot meet the required capabilities.

(iii) Flexibility test

This subsection analyzes and tests how the RL agent can make subsequent decisions based on the existing development scheme in response to a change in capability requirements during the development process (see Table 9).A hypothetical scenario is that the RL agent has already developed some equipment for the old capability requirements during the first five stages,but the capability requirements start to change in the sixth stage.Thus,the RL agent needs to adjust in the response to the new capability requirements from the sixth stage,as shown in Table 10.In addition,investments in all stages are set to 60.

Table 9 Changes in capability requirements

Table 10 Equipment development scheme with adjustment

It shows that the RL agent quickly adapts after the capability requirement changed in the sixth stage.Finally,it achieves the same performance as the optimal development scheme,indicating that the RL agent has a certain ability to deal with future changes.

6.Conclusions

Given the high uncertainty and multi-stage characteristics of EDP,this study designs a DRL-based multi-stage EDP model and defines the environment,state,action,state transition,and reward function in the corresponding DRL framework.In the meantime,the Dueling DQN algorithm is employed for solving the multi-stage EDP problem.Finally,the model performance is tested using a test case.Firstly,a sensitivity analysis is conducted to determine the best parameters of the algorithm.Secondly,the model is tested in 100 randomly generated test environments,and results show that the model can achieve almost the same performance as the baseline method.For two typical uncertainties in multi-stage EDP,the proposed model allows data to be directly input to a neural network without any prior adjustment and generates a corresponding planning scheme.On the contrary,other methods either need to re-optimize in response to environmental changes or provide a compromise scheme with the best average performance across all scenarios based on the scenario probability distribution.However,such a scheme may not be optimal in all scenarios.

Although the proposed DRL-based multi-stage EDP model can quickly respond to any changes in the environment parameters,it has some limitations: (i) the overall planning scheme may not be optimal;(ii) as the number of equipment increases,the action space increases exponentially.As a result,traversing the actions list will greatly increase the time and computation costs,making it hard for the proposed method to be applied to large scale problems.At present,there is no perfect solution to the first limitation.Even common optimization algorithms cannot guarantee the optimality of the solution.For the second limitation,it is recommended that the policy-based RL methods are adopted in future research to deal with the large-scale action spaces or even continuous infinite action spaces,to finally learn the probability density function in the large-scale action space and select the action with the largest probability as the optimal policy.

Journal of Systems Engineering and Electronics2022年6期

Journal of Systems Engineering and Electronics的其它文章: Component reallocation and system replacement maintenance based on availability and cost in series systems; Search for d-MPs without duplicates in two-terminal multistate networks based on MPs; Torque estimation for robotic joint with harmonic drive transmission based on system dynamic characteristics; Sliding mode fault tolerant consensus control for multi-agent systems based on super-twisting observer; An optimal guidance method for free-time orbital pursuit-evasion game; Fiber resonator using negative-curvature anti-resonant fiber with temperature stability

亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放