Xiaoyu LIU, Chi XU, Haibin YU, Peng ZENG
1State Key Laboratory of Robotics, Shenyang Institute of Automation,Chinese Academy of Sciences, Shenyang 110016, China
2Key Laboratory of Networked Control Systems, Chinese Academy of Sciences, Shenyang 110016, China
3Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang 110169, China
4University of Chinese Academy of Sciences, Beijing 100049, China
Abstract: Edge artificial intelligence will empower the ever simple industrial wireless networks (IWNs) supporting complex and dynamic tasks by collaboratively exploiting the computation and communication resources of both machine-type devices(MTDs)and edge servers. In this paper,we propose a multi-agent deep reinforcement learning based resource allocation(MADRL-RA)algorithm for end-edge orchestrated IWNs to support computation-intensive and delay-sensitive applications. First, we present the system model of IWNs, wherein each MTD is regarded as a self-learning agent. Then, we apply the Markov decision process to formulate a minimum system overhead problem with joint optimization of delay and energy consumption. Next, we employ MADRL to defeat the explosive state space and learn an effective resource allocation policy with respect to computing decision, computation capacity,and transmission power. To break the time correlation of training data while accelerating the learning process of MADRL-RA,we design a weighted experience replay to store and sample experiences categorically. Furthermore,we propose a step-by-step ε-greedy method to balance exploitation and exploration. Finally, we verify the effectiveness of MADRL-RA by comparing it with some benchmark algorithms in many experiments, showing that MADRL-RA converges quickly and learns an effective resource allocation policy achieving the minimum system overhead.
Key words: Multi-agent deep reinforcement learning; End-edge orchestrated; Industrial wireless networks; Delay;Energy consumption
With the rapid development of intelligent manufacturing, massive distributed machine-type devices (MTDs) interconnected by industrial wireless networks (IWNs) generate a vast amount of heterogeneous data, which are generally computationintensive and delay-sensitive (Yao et al., 2019; Xu et al., 2021; Yu et al., 2021).For resourceconstrained MTDs, it is a challenge to process data locally. Thus, a centralized cloud computing paradigm was first proposed (Kumar et al.,2019). The cloud server acts as a centralized intelligent agent to uniformly schedule computation resources for MTDs. Correspondingly, MTDs offload data to the cloud server and download the processed data. However, owing to the nature of centralized services, the cloud server is typically deployed far from MTDs. Therefore, offloading data from MTDs to the cloud server triggers significant communication delay, which is intolerable for real-time applications, such as robot control,augmented/virtual reality,and automatic drive.
To address the above issues associated with cloud computing, the multi-access edge computing(MEC) paradigm was proposed (Porambage et al.,2018), which distributes computation resources to massive edge servers deployed in base stations, access points, and other edge infrastructures. However,the high concurrent offloading of massive MTDs may result in traffic congestion and edge server overload, adding additional delay and energy consumption. Meanwhile, during the manufacturing process,it is a serious challenge for energy-constrained MTDs that MTDs generate and offload data continuously.Thus, to make full use of the computation resources of MTDs and edge servers while extending the energy lifetime of MTDs, a flexible resource allocation policy for integrating the resources of MTDs and edge servers is urgently required (Zhang YM et al., 2020; Ren et al., 2021). With end-edge orchestrated resource allocation, IWNs can support massive computation-intensive and delay-sensitive industrial applications simultaneously.
Generally, conventional model-driven algorithms require complete system information to establish an accurate system model and obtain an effective resource allocation policy. However, the dynamic and time-varying nature of end-edge orchestrated IWNs renders it difficult to collect the complete system information for establishing an accurate system model. On the contrary,reinforcement learning(RL),as a self-learning artificial intelligence(AI)technology,uses agents interacting with IWNs to obtain local system information and approximate the system model (Shakarami et al., 2020; Wang et al.,2020). However, the local system information from distributed MTDs is numerous and coupled, resulting in an explosive state space. To tackle the state space explosion, deep learning (DL) is used to express the relationship of local system information by interconnected weighted neurons. Thus, combining DL and RL, i.e., using deep reinforcement learning(DRL), to obtain the resource allocation policy not only gives play to the self-learning advantage of RL but also uses DL to deal with state space explosion.
Previous works(Lin et al.,2019;Wei et al.,2019;Liu KH and Liao,2020;Lu et al.,2020;Xiong et al.,2020; Chen et al., 2021; He et al., 2021; Liu XY et al., 2021) used DRL to optimize resource allocation in wireless networks mainly from the perspective of centralized intelligence. In other words, a single agent collects the global system information to approximate the system model and learn the resource allocation policy. However,in end-edge orchestrated IWNs, distributed MTDs are mobile, and the available resources of MTDs and edge servers are timevarying. In this case, it is difficult for a single agent to track the global system information. Besides,the delay and energy consumption during the collection of global system information may be intolerable for real-time applications. On the contrary, distributed MTDs have potential advantages to achieve swarm intelligence. Every MTD acts as an independent and intelligence-endogenous (Zhang P et al., 2022)agent that can easily observe its local system information, and the cooperation among multiple MTDs(i.e.,multiple agents)(Zhang KQ et al.,2021)can approximate the system model,achieve logical resource allocation, and adapt to the dynamic end-edge orchestrated IWNs.
Fully considering the advantages of multiple agents, we use multi-agent DRL (MADRL) to solve the resource allocation problem for end-edge orchestrated IWNs. In this work,we first establish the system model of end-edge orchestrated IWNs. Then,we formulate the minimum system overhead with joint optimization of delay and energy consumption, and apply the Markov decision process(MDP)to express it. Next, we propose an MADRL based resource allocation(MADRL-RA)algorithm,wherein the algorithm architecture, learning process, and computational complexity are presented in detail. Finally,in comparison with some benchmark algorithms,we verify the performance of MADRL-RA by many experiments.
The main contributions are summarized as follows:
1. Considering the diverse constraints on resource allocation, we apply an MDP to formulate the joint optimization of delay and energy consumption, and use MADRL to learn an effective resource allocation policy for minimizing the system overhead with respect to delay and energy consumption.
2.To ensure that the training data are independent and identically distributed while accelerating the learning process of MADRL-RA, we design a weighted experience replay to store and sample experiences categorically.
3. To balance the exploration and exploitation of knowledge about IWNs, we propose a step-bystepε-greedy method to adjust the probabilities of exploration and exploitation dynamically.
For optimizing resource allocation in wireless networks, there are many non-AI algorithms (Guo et al., 2017; Tang and He, 2018; Zhang GL et al.,2018; Feng et al., 2019; Li et al., 2020) considering the joint optimization objects of delay and energy consumption. However, the accurate system model must be known in the above non-AI works, which is difficult for dynamic and time-varying end-edge orchestrated IWNs. Thus, we focus on DRL-based algorithms including single- and multi-agent DRL algorithms.
For single-agent DRL algorithms, the base station usually acts as a single agent to optimize resource allocation. Previous works (Lin et al., 2019;Xiong et al., 2020) proposed improved value-based deepQnetwork (DQN) algorithms to efficiently exploit the computation capacities of edge servers and significantly reduce the average delay and energy consumption. Similarly, Alfakih et al. (2020) proposed a value-based SARSA algorithm to deal with the issue of resource management and make an optimal offloading decision to minimize the system cost of energy consumption and computing delay. Several works (Lu et al., 2020; Chen et al., 2021; He et al.,2021) proposed improved deep deterministic policy gradient(DDPG) algorithms to optimize the service latency, energy consumption, and task success rate.In addition, several works(Wei et al., 2019;Liu KH and Liao, 2020) presented actor-critic-based algorithms to minimize the delay and energy consumption and optimize caching and channel access.
Mobile MTDs and time-varying resources result in the dynamic and time-varying nature of IWNs. It is difficult for single-agent DRL algorithms to tackle the dynamic and time-varying nature and learn an optimal policy. Therefore, some multi-agent DRL algorithms (Foerster et al., 2016; Lowe et al., 2017;Rashid et al., 2018) have been proposed to solve the above problem, wherein distributed users act as agents. By cooperation among distributed agents,agents observe the local system information to approximate the system model and learn the resource allocation policy. There have been several works applying multi-agent DRL algorithms in industry. For example,Cao et al.(2020)used a multi-agent DDPG algorithm to design multi-channel access and offloading control in Industry 4.0, and Zhu et al. (2021)used the multi-agent DDPG algorithm to minimize the total task processing delay in the multi-vehicle environment. Chu et al.(2020)applied a multi-agent advantage actor-critic algorithm to adapt traffic signal control in complex transportation networks.
As shown in Fig. 1, we consider an end-edge orchestrated IWN withNindustrial base stations(IBSs) andMMTDs. All distributed IBSs constitute a set of IBSsN={1,2,···,N}, and all distributed MTDs constitute a set of MTDsM={1,2,···,M}. Each IBS is equipped with an edge server. MTDs are randomly distributed and generate heterogeneous data. Generally, these data are different in data size and required computation resource. We denote the data of themthMTD asDm={dm,cm}, wheredmandcmdenote the data size and required computation resource,respectively.Meanwhile,we denote the transmission power of themthMTD aspm, wherepm ∈[0,P]. Specifically, 0 denotes the zero transmission power,andPdenotes the maximum transmission power.
Furthermore,the coverage radius of thenthIBS is denoted asμn. If the distance between themthMTD and thenthIBS (denoted aslnm) is less thanμn, themthMTD can communicate with thenthIBS. At any time-slot, each MTD can communicate with only one IBS.The communication rate between MTDs and IBSs varies depending on the bandwidth,transmission power,noise interference,and inter-cell interference. In detail, the communication rate between themthMTD and thenthIBS can be expressed as
Fig. 1 Network model (IBS: industrial base station;MTD: machine-type device)
3.2.2 Edge computing
If themthMTD decides to offload data to thenthIBS, the data need to be transmitted via uplink networks. After processing the data in thenthIBS,themthMTD downloads the processed data from thenthIBS via downlink networks. Because the size of the processed data is much smaller thandm, the downlink delay and downlink energy consumption are very low. Hence, in edge computing, we ignore the downlink transmission process.
In edge computing,the delay includes the uplink transmission delay and processing delay,expressed as
To make full use of the computation capacity while reducing the energy consumption,we optimize the weighted sum of delay and energy consumption,namely the system overhead. The joint optimization problem is formulated as follows:
MTDs, as agents, interact with end-edge orchestrated IWN and approximate the system model.During the interaction, MTDs change their states and obtain distinct rewards by performing different actions. By maximizing the long-term cumulative reward,we can solve the joint optimization problem(8). Next, we use an MDP to model the interaction process,which is described by the state,action,reward,and state transfer function.
4.2.1 State
Comprehensively,am,o(t)∈ {0,1,···,N},wheream,o(t)=0 indicates that themthMTD processes data in end computing, andam,o(t) =nindicates that themthMTD processes data in edge computing (i.e., it offloads data to the edge server at thenthIBS). Similarly,am,p(t)∈{0,1,···,P},wheream,p(t) = 0 indicates that themthMTD processes data in end computing, andam,p(t) =pindicates that themthMTD offloads data with transmission powerp. Besides, if several MTDs simultaneously offload data to the same edge server,the computation capacity of the edge server is assigned equally to these MTDs (Dai et al., 2020).Therefore,am(t)is strictly constrained by C1-C5.
4.2.3 Reward
At time-slott,each MTD obtains a rewardrm(t)by performing actionam(t) at statesm(t).rm(t)comprises the delay rewardrm,d(t) and energy consumption rewardrm,e(t), and it is expressed as
As shown in Eq. (12), when dataDmare processed in end computing, the rewardrm(t) is 0. On the contrary, when dataDmare processed in edge computing,the rewardrm(t)is a non-zero real value.If the delay in edge computing is lower than that in end computing,rm,d(t) is a positive reward; otherwise,rm,d(t) is a negative reward. Similarly, if the energy consumption in edge computing is lower than that in end computing,rm,e(t) is a positive reward;otherwise,rm,e(t)is a negative reward.
Obviously,the joint optimization problem(8)is transformed into a reward with respect to delay and energy consumption. Next, by maximizing the longterm cumulative rewardRm(t),we can determine an effective resource allocation policy that minimizes the system overhead,i.e.,
whereγdenotes a discounted factor indicating the degree of influence of the past rewards on the current reward,τthe past time-slots, andr(τ) the reward froms(τ) tos(τ+1).
4.2.4 State transfer function
With the formulated MDP, we further propose the MADRL-RA algorithm to address the joint optimization problem. MADRL-RA is an actor-critic framework based algorithm. Each MTD is an independent agent including an actor, a critic, and an experience memory. For an independent agent, the actor is used to generate action,and the critic is used to guide the actor in generating a better action. The experience memory is used to store experiences for training the actors and critics. Next, we present the detailed MADRL-RA framework and the learning process(Fig. 2).
We use deep neural networks(DNNs)as the fundamental network framework for actors, comprising estimation actor networks and target actor networks.Similarly, critics use a DNN framework comprising estimation critic networks and target critic networks.All estimation networks are policy-based DNNs,i.e.,π(s|θπ). All target networks are value-based DNNs,i.e.,Q(s,a|θQ). Target networks are used to generate the target values for training the policies of estimation networks. For any actor or critic, the network frameworks of the estimation network and the target network are the same, but their parameters are different. Specifically,for any actor or critic,the parameter of the target network copies the historical parameter of the estimation network,namely soft update, which can effectively avoid oscillation.
5.1.1 Actor network
Fig. 2 MADRL-RA framework
As shown in Fig. 3, for each MTD, the input of themthestimation actor network is its current statesm(t). Through three fully connected layers with the rectified linear unit (ReLU) activation function, the input is transformed into the current actionam(t).Similarly, the input of themthtarget actor network is its next statesm(t+1), which is transformed into the next actionam(t+1). In summary, each actor generates its own actionamwith policyπmparameterized byθπm, i.e.,am=πm(sm|θπm).
Fig. 3 Actor network
To learn more knowledge from the unknown IWN, MTDs need to balance exploitation and exploration. Exploitation means that MTDs use the learned knowledge by greedily taking action with the maximum value, and exploration means that MTDs acquire unknown knowledge by taking action randomly. During the learning process, we adopt theε-greedy method to balance the taken action,i.e.,
Specifically, with a small value ofε, MTDs prefer to perform exploitation. On the contrary,with a high value ofε,MTDs prefer to perform exploration.However, with a high value ofεfor a long time,the learning process of MADRL-RA will oscillate,making it difficult to learn an effective resource allocation policy. To explore more unknown knowledge while avoiding oscillation, we design a step-by-stepε-greedy method. Initially, MTDs have little knowledge,and they perform more random actions with a high value ofε. As MTDs gradually acquire enough knowledge,the value ofεdecreases,and MTDs prefer to exploit the learned knowledge and perform the action with the maximum value for learning an effective resource allocation policy. The variation ofεis given by where 0<ε ≤1,βdenotes the decreased rate of exploration,ε0the initial exploration value, andUthe number of training times.
5.1.3 Parameter update
With these actions andQ-values, the parameters of actor networks and critic networks are updated by stochastic gradient descent. Specifically,the parameter of themthestimation actor network(i.e.,θπm)is updated as
Fig. 4 Critic network
whereλ ∈[0,1].
MADRL-RA requires sufficient training data to learn an effective resource allocation policy. Thus,we use the experienceEcomprising the state, action, reward,and next state as the training data for actors and critics. Each MTD owns an experience memoryHstoring the experiences. At time-slott,the experience of themthMTD is expressed as
To ensure the effective learning of MADRL-RA,the training data sampled from experience memory must be independent and identically distributed.Therefore,we use the experience replay to randomly sample experiences and break their time correlation.In the classical experience replay,the sampling probability of any experience is the same. However, we consider that different experiences have different contributions toward the learning process and the convergence of MADRL-RA,and hence use the temporal difference errorδhas the weight of thehthexperience(Schaul et al., 2016),i.e.,
yh=|δh|+ζ,(25)
whereζdenotes a positive real value approximately equal to 0 for enabling the experience withδh=0 to be sampled.
To simplify the proposed experience replay, we design a weighted experience replay with two subexperience memories(called memoryAand memoryB),where memoryAand memoryBstore high-and low-weight experiences,respectively. Specifically,we adopt the average weight (denoted asy) to clarify the high- or low-weight experience,expressed as
Ifyh ≥y, thehthexperience is stored in memoryA; otherwise, it is stored in memoryB. Specifically, at the beginning of training, memoryAand memoryBhave the same sampling probability. To accelerate the learning process and the convergence of MADRL-RA, the sampling probability of memoryAis increased with the increase of the number of training times,but the sampling probability of memoryBdecreases. Thus, the sampling probabilitygxis expressed as
wherex ∈{A,B}, 0≤gx ≤1,g0denotes the initial sampling probability of memoryAorB, andgdxdenotes the sampling decay rate of memoryAorB.
With the proposed weighted experience replay,experiences are sampled as training data. Among these training data,sm(t) is fed into themthestimation actor network and transformed into the current actionam(t). Further,SandAare fed into themthestimation critic network and transformed into the currentQ-valueQπm(S,A|θQm). Similarly,sm(t+1) is fed into themthtarget actor network
Algorithm 1 Training process of MADRL-RA 1: for u = 0, 1, ..., U do 2:Sample L experiences from memory A and memory B as training data;3:for m = 0, 1, ..., M do 4:Input sm(t) to the mth estimation actor network,and obtain am(t) = πm(sm(t)|θπm);5:Transfer from sm(t) into sm(t+1) by performing am(t), and obtain reward rm(t);6:Store or update sm(t), am(t), rm(t), and sm(t+1)in experience memories;7:Input S and A to the mth estimation critic network,and obtain Qπm■S,A|θQm■;8:Input sm(t+1) to the mth target actor network,and obtain am(t+1) =π′■■m sm(t+1)|θπ′m ;9:Input S′ and A′ to the mth target critic network,and obtain Qπ′■■m S′,A′|θ′Qm ;10:Update θπm and θQm;11:Set sm(t) as sm(t+1);12:end for 13:Update θ′14: end for πm and θ′Qm;
Specifically,during the training of MADRL-RA,each actor needs its own state and theQ-value from the corresponding critic, while the critic needs the states and actions of all actors. After the training process is completed, the execution process needs only these actors,and each actor can make an effective action according to its own state.
To characterize the efficiency of MADRL-RA,we further evaluate the computational complexity.Actors and critics are all DNNs-based,and the computational complexity(Naparstek and Cohen,2019)is expressed as
whereO(G) denotes the computational complexity of DNNs,Fthe number of layers,anddfthe number of neurons at thefthlayer. Specifically, the computational complexity of the actor network is expressed asOa(G), and the computational complexity of the critic network is expressed asOc(G).
In offline training, withUepisodes,Lexperiences,andMagents,the computational complexities of actors and critics areOa(GLUM)andOc(GLUM)respectively. The high computation overhead of the training process is consumed offline, which does not interfere with the real-time execution in manufacturing processes. After the offline training, MADRLRA can be implemented for online applications with only actors. During online execution, the computational complexity of MADRL-RA isOa(G).
In this section,many experiments are performed to verify the performance of MADRL-RA with diverse experimental scenarios, run with TensorFlow-GPU-1.14.0 and Python-3.7 on a desktop powered by Intel Xeon W2245 and NVIDIA Titan RTX.
We consider a scalable IWN with a mix of heterogeneous industrial data, and the key parameters and their values(Wei et al.,2019)are summarized in Table 1. Similarly, the parameters and their values of MADRL-RA(Cao et al.,2020)are summarized in Table 2.
To verify the effectiveness of MADRL-RA, we compare MADRL-RA with the following benchmarkalgorithms(Dai et al., 2020;Lu et al.,2020):
Table 1 Key parameters of IWN
Table 2 Key parameters of MADRL-RA
1. Full end computing (FEC): an algorithm in which all MTDs process data in end computing.
2. Nearest edge computing(NEC):an algorithm wherein all MTDs offload their data to the nearest IBSs.
3. DQN:a single-agent DRL algorithm wherein a central agent allocates resources to MTDs.
4. Greedy: using violent iteration to obtain all possible resource allocation actions and to find the minimum system overhead within these actions.
Reward is the metric used to measure the effectiveness of DRL-based algorithms. Fig. 5 shows the convergence trend of MADRL-RA and DQN. With the help of weighted experience replay,MADRL-RA converges faster and obtains a higher reward than DQN.Specifically,the convergence rate of MADRLRA is 35% higher than that of DQN, and the reward of MADRL-RA is 20% higher than that of DQN.Because FEC,NEC,and Greedy are not DRLbased algorithms, there is no reward for them in Fig.5. Next,we adopt MADRL-RA and DQN as the representative algorithms in the DRL field to compare them with the classical FEC,NEC,and Greedy algorithms.
Fig. 6 describes the relationship between the system overhead and the computation capacity of edge servers, whereM= 10 andω= 0.5. Because end computing depends only on the local computation capacity, the system overhead of FEC is a fixed value, regardless of the computation capacity of edge servers. On the contrary,with the computation capacity of edge servers increasing, the system overheads of NEC, Greedy, DQN, and MADRL-RA all gradually decrease, while the system overheads of Greedy and MADRL-RA are always the smallest.Specifically, when the computation capacity of edge servers is weak (e.g., the computation capacity of edge servers is smaller than 2 GHz/s),the transmission cost outweighs the offloading benefit,and MTDs prefer to perform end computing, resulting in the system overhead of FEC being smaller than that of NEC.In this case,MADRL-RA can learn an effective resource allocation policy, achieving the same minimum system overhead as Greedy. However,when the computation capacity of edge servers is strong(e.g.,the computation capacity of edge servers is larger than 8 GHz/s), the offloading benefit outweighs the transmission cost, and the system overhead of NEC is smaller than that of FEC. In this case, offloading data in NEC depends only on the distance between MTDs and IBSs,which triggers edge server overload and high system overhead. Fortunately, MADRLRA can compromise between the distance and the computation capacity and learn an effective resource allocation policy, achieving smaller system overhead than NEC.
Fig. 5 Normalized reward
Fig.7 shows the relationship between the system overhead and the data size, whereM=10,ω=0.5,andFn= 10 GHz/s. Because end computing does not require transmitting data,the changing data size does not affect the system overhead of FEC. On the contrary, with the increase of data size, the system overheads of NEC,Greedy,DQN,and MADRLRA all gradually increase, while the system overheads of MADRL-RA and Greedy are the smallest.When the data size changes from 500 to 3000 KB,the system overheads of MADRL-RA, Greedy, and DQN are much smaller than those of NEC and FEC.However, when the data size is larger than 70 MB,the offloading benefit does not compensate for the transmission cost, and the system overhead of MADRL-RA approaches that of FEC.
Fig. 6 System overhead vs. computation capacity of edge servers (M =10,ω =0.5)
Fig. 7 System overhead vs. data size (M = 10, ω =0.5,Fn =10 GHz/s)
Fig.8 shows the relationship between the system overhead and the required computation resources,whereM= 10,ω= 0.5, andFn= 10 GHz/s. With the increase of the required computation resources,the system overheads of FEC, NEC, Greedy, DQN,and MADRL-RA gradually increase, and the system overhead of MADRL-RA is always the smallest,same as that of Greedy. Specifically, when the required computation resource is smaller than 5000 MHz, the local computation capacity is sufficient for processing data in end computing. Therefore,the system overhead of FEC is smaller than that of NEC. On the contrary, when the required computation resource is larger than 5000 MHz, the strong computation capacity of edge servers provides high offloading benefit,which makes the system overhead of NEC become smaller than that of FEC.
Fig. 9 System overhead vs. the number of MTDs(Fn =10 GHz/s)
Fig. 8 System overhead vs. required computation resources (M =10,ω =0.5,Fn =10 GHz/s)
Fig. 9 describes the relationship between the system overhead and the number of MTDs, whereFn=10 GHz/s. With the increase in the number of MTDs,the system overheads of FEC,NEC,Greedy,DQN, and MADRL-RA gradually increase. Specifically, when the number of MTDs is around 10, the computation capacity that MTDs obtain from edge servers is roughly equal to their local computation capacity. Therefore, in this case, the system overheads of FEC, NEC, Greedy, DQN, and MADRLRA are similar. However,when the number of MTDs is larger than 15, the high concurrent offloading of massive MTDs may result in edge server overload.In this case, the computation capacity that MTDs obtain from edge servers is smaller than their local computation capacity. Thus, the system overhead of NEC increases significantly. When the number of MTDs exceeds 30, the offloading benefit does not compensate for the transmission cost,and the system overheads of FEC, DQN, and MADRL-RA are the same; i.e., the learned effective resource allocation policy of MADRL-RA is FEC.
To verify the effect ofωon the computing decision and transmission power of MTDs with different data sizes and required computation resources, we consider four types of data(Table 3).
Table 3 Data types
Fig. 10 describes the relationships between the end computing ratio andωand between the edge computing ratio andω, where the solid and dashed lines represent the end computing ratio and edge computing ratio,respectively. Comprehensively,the end computing ratio represents the ratio of the number of MTDs in end computing to the total number of MTDs, and the edge computing ratio represents the ratio of the number of MTDs in edge computing to the total number of MTDs. Asωincreases,delay becomes increasingly dominant in the system overhead. Type 1 and type 3 all require high computation resources, but they differ in data size. Specifically,withωincreasing, type 1 and type 3 prefer edge computing to obtain strong computation capacity,reducing the processing delay; i.e., edge computing ratios of type 1 and type 3 gradually increase. Because the data size of type 3 is smaller than that of type 1, the edge computing ratio of type 3 is higher than that of type 1. On the contrary,type 2 and type 4 all require low computation resources and differ in data size. Withωincreasing,because the data size of type 2 is higher than that of type 4,type 2 prefers end computing to reduce the transmission delay. Thus,the end computing ratio of type 2 gradually increases and is higher than that of type 4.
Correspondingly,Fig. 11 describes the relationship between the transmission power ratio and the value ofω, where the transmission power ratio represents the ratio of the transmission power used by MTDs to the maximum transmission power (i.e.,pm/P). We still consider the four types of data in Table 3. Type 1 and type 3 require high computation resources and prefer edge computing, resulting in high transmission power ratios. Asωincreases,the transmission power ratios of type 1 and type 3 also increase. Specifically, because the data size of type 3 is smaller than that of type 1, the edge computing ratio of type 3 is higher,resulting in the higher transmission power ratio of type 3. Type 2 and type 4 require low computation resources and prefer end computing, resulting in smaller transmission power ratios compared with type 1 and type 3. Specifically,because the data size of type 2 is larger than that of type 4, the end computing ratio of type 2 is higher,resulting in the smaller transmission power ratio of type 2.
Fig. 10 End computing ratio and edge computing ratio vs. ω
Fig. 11 Transmission power ratio vs. ω
In this study, we regarded distributed MTDs as multiple self-learning agents to deal with dynamic and time-varying end-edge orchestrated IWNs, and proposed the MADRL-RA algorithm to learn an end-edge orchestrated resource allocation policy to minimize system overhead with respect to delay and energy consumption. Compared with FEC, NEC,and DQN, MADRL-RA can learn the effective resource allocation policy and adapt to changes in the computation capacity of edge servers, data size, required computation resource,and number of MTDs.Moreover,by setting different weight factors,we can optimize resource allocation to satisfy heterogeneous industrial applications in terms of different delay or energy consumption levels. Compared with modeldriven algorithms, the offline training of MADRLRA requires a certain computation overhead,but the online execution of MADRL-RA can achieve flexible and autonomous resource allocation with low computational complexity. In future work,we will apply MADRL-RA in practical IWNs for real end-edge orchestrated resource allocation.
Contributors
Xiaoyu LIU, Chi XU, and Haibin YU designed the research. Xiaoyu LIU processed the data and drafted the paper. Chi XU,Haibin YU, and Peng ZENG helped organize the paper. Xiaoyu LIU and Chi XU revised and finalized the paper.
Compliance with ethics guidelines
Xiaoyu LIU, Chi XU, Haibin YU, and Peng ZENG declare that they have no conflict of interest.
Frontiers of Information Technology & Electronic Engineering2022年1期