Yu Zhao,Joohyun Lee,*,Wei Chen
1 Department of Electrical and Electronic Engineering,Hanyang University,Ansan 15588,South Korea
2 Department of Electronic Engineering and Beijing National Research Center for Information Science and Technology,the Department of Electronic Engineering,Tsinghua National Laboratory for Information Science and Technology(TNList),Tsinghua University,Beijing 100084,China
Abstract:This paper proposes a Reinforcement learning(RL)algorithm to find an optimal scheduling policy to minimize the delay for a given energy constraint in communication system where the environments such as traffic arrival rates are not known in advance and can change over time.For this purpose,this problem is formulated as an infinite-horizon Constrained Markov Decision Process(CMDP).To handle the constrained optimization problem,we first adopt the Lagrangian relaxation technique to solve it.Then,we propose a variant of Q-learning,Q-greedyUCB that combines ε-greedy and Upper Confidence Bound(UCB)algorithms to solve this constrained MDP problem.We mathematically prove that the Q-greedyUCB algorithm converges to an optimal solution.Simulation results also show that Q-greedyUCB finds an optimal scheduling strategy,and is more efficient than Q-learning with ε-greedy,R-learning and the Averagepayoff RL(ARL)algorithm in terms of the cumulative regret.We also show that our algorithm can learn and adapt to the changes of the environment,so as to obtain an optimal scheduling strategy under a given power constraint for the new environment.
Keywords:reinforcement learning for average rewards;infinite-horizon Markov decision process;upper confidence bound;queue scheduling
Recently,with the emergence of the fifth-generation(5G)of cellular networks,there will be a large number of communication base stations and billions of communication devices connected to each other.For example,vehicular communication connects vehicles to vehicles and vehicles to infrastructure,etc.,as shown in Figure 1.It can provide diverse services such as entertainment,timely road information,and navigation services over the Internet[1,2].Therefore,for a communication system,low latency,and low energy consumption are essential requirements[3,4].
Figure 1.An application scenario.
Figure 2.System model.
For wireless communication systems,if the channel conditions do not change over time,the energy consumption is a convex and monotonically increasing function of the service rate[5–7].This means that the per-bit energy consumption increases as the service rate increases or the latency decreases.Therefore,there exists a trade-off between latency and energy consumption.In this work,we aim to find an optimal scheduling policy that minimizes the latency for given energy constraints,and propose a reinforcement learning(RL)-based method to solve this problem.
Scheduling under delay or energy constraints has been studied in[5–13].Among these studies,the Linear Programming(LP)method is adopted to obtain the optimal trade-off between delay and power consumption in[5,6,8–10].Also,the Lagrange relaxation technique has been applied in[5,6,9],in order to convert a constrained optimization problem into an unconstrained problem.To minimize the average delay under an energy budget for communication networks,Chenet al.[5,6]have modeled this problem as a constrained Markov decision process(CMDP)and then proposed an algorithm to efficiently obtain the optimal delay-power trade-off.Also,the Dynamic Programming(DP)method is used to obtain the optimal scheduling policy.Wanget al.[8]applied the LP method to solve the optimization problem,and then a probabilistic scheduling strategy is used to address the delay-power trade-off.Similarly,in order to reduce the latency under an energy consumption constraint in a communication system with Markov arrivals,Zhaoet al.[9]proposed a threshold-based algorithm to solve it.In[10],to minimize the average latency,authors have proposed a stochastic scheduling policy under power constraints based on the channel and buffer states.In[11],an RL algorithm is presented to solve scheduling problems in an online manner with a faster learning rate.A potential application scenario for the above work is vehicular communication.
Existing studies on the optimization problem for delay and power mostly take conventional methods without a learning algorithm.A challenging problem that arises in this method is that the algorithm needs to be executed repeatedly as the environment changes over time.Moreover,the DP method requires perfect knowledge about the environment and stationarity,which is limited in practice.To this end,we propose a novel model-free RL algorithm to obtain an optimal scheduling policy under a given power constraint for time-varying environments,which minimizes the average delay in this paper.
RL is widely used to solve optimization problems with the model-unknown system(i.e.,the state transition probability distribution is unknown).The Qlearning algorithm[14]is a widely used model-free RL algorithm.Several studies have proposed variants of the Q-learning algorithm to improve its performance,such as speedy Q-learning[15],Delayed Q-learnig[16],HAQL[17].In particular,Jinet al.proposed Q-learning with the UCB(Upper Confidence Bound)exploration policy,and proved that it obtains the optimal regret for a finite-horizon MDP with discounted rewards[18].Donget al.proposed Qlearning with UCB for an infinite-horizon MDP with discounted rewards and proved that it has higher performance than the state-of-the-art RL algorithm[19].However,most studies have focused on discounted rewards.In other words,the goal of the agent is to maximize the cumulative discounted reward,which is limited.As Mahadevan showed in[20],when coping with cyclical tasks,this method can make the agent select a second-best action if the short-run cumulative reward is more attractive than the long-run.
An alternative method is to maximize theaveragereward.Schwartz[21]proposed a novel algorithm,Rlearning,that focuses on theaveragereward to compensate for the shortcomings mentioned above.The Rlearning algorithm finds an optimal policy by estimating the value of the state-action pairR(s,a).The estimated average reward is updated when a greedy action is performed.Singhet al.[22]noticed that R-learning is inefficient since it is not updating the estimated average reward in each time step.An enhanced version of R-learning,the Average-payoff RL(ARL)algorithm is proposed in[22],in which the estimated average reward is updated at each time step.However,the author did not compare the performance of the ARL and R-learning algorithms by conducting practical experiments.Also,their algorithms were not proved to be convergent.
In[23],Q-learning with theε-greedy for theaveragereward algorithm is employed to obtain the optimal scheduling policy in a single-queue-single-server communication system.This paper presents a new RL algorithm named Q-greedyUCB for an infinitehorizon CMDP withaveragereward criteria in order to obtain the higher reward during the learning period.We formulate this problem as a CMDP,and then the Lagrange relaxation technique is used to convert the constrained optimization problem into the unconstrained problem.Also,we mathematically prove the convergence of the proposed algorithm.We also compare the performance of the proposed algorithm with Q-learning withε-greedy,R-learning,and the ARL algorithm.Our proposed algorithm is shown to be more efficient than these existing algorithms via simulation.The main contributions of this paper are as follows:
? We propose a new RL algorithm,Q-greedyUCB to obtain an optimal scheduling policy to minimize the delay under given power constraint in a single-queue single-server communication system when the environment is not known and changes over time.
? In Q-greedyUCB,we combine theε-greedy and UCB algorithms to balance the exploration and exploitation,which in turn improves the traditional Q-learning that only uses theε-greedy algorithm.It is mathematically proved that the proposed algorithm converges to an optimal scheduling policy.
? Through simulation,we show that Q-greedyUCB outperforms Q-learning,R-learning and ARL[22]in terms of cumulative regret.The simulation results show that when the environment changes over time,the algorithm can learn and adapt to the changes of the environment,so as to obtain an optimal scheduling strategy.
The remainder of the paper is organized as follows.We describe the system model in Section II,where the queue scheduling problem is formulated as an infinitehorizon MDP with constraints.In Section III,the RL methodology is presented in detail,including Upper Confidence Bound,the Q-greedyUCB algorithm foraveragereward and how to apply the Lagrangian multiplier method to solve the constrained optimization problem.The simulation results are given in Section IV.Concluding remarks and discussions are provided in Section V.
In this paper,we consider a single-queue-single-server system over a block fading channel,as shown in Figure 2.The time is divided into discrete time slots,i.e.,t∈{1,2,···}.We consider a general data arrival scenario where the data arrival rate can follow any distribution.Letτ[t]denotes the number of data packets arriving at time slott.Thus,the probability mass function ofτ[t]is given by
whereαm∈[0,1]andMrefer to the maximum number of data packets arriving in each time slot.We can calculate the average data arrival rate We also define[α1,α2,···,αM].
Arrived data packets are first added to a finite buffer with a maximum size ofB.The queue length is denoted byq[t]at time slott.Definec[t]as the number of packets to be transmitted in time slott.Due to the limitation of the transmitter,it can transmit up toCdata packets per time slot,and thusc[t]∈{0,1,···,C}.Therefore,the queue length in the next time slot is given by
To prevent buffer overflow or underflow,c[t]needs to satisfy 0≤q[t]?c[t]≤B ?M.Moreover,the maximum buffer size and the number of packets arriving should satisfyB≥M.Therefore,for a given queue lengthq[t],c[t]should satisfy max(0,q[t]?B+M)≤c[t]≤min(q[t],C).From Eq.(2),we can see that the future queue length depends only on the present queue length and the service rate.In other words,this process satisfies the Markov property,and hence this system can be viewed as an MDP,which we will formally define in Section 3.1.
We assume the channel state information is know to the transmitter via a feedback channel[24].We consider the wireless channel experienceH-states block fading,whereHis a positive integer.Letdenote the channel state at time slott,whereis the set of channel states.We consider that the channel stateh[t]in each time slot follows an independent and identically distribution over time.Thus,the probability mass function ofh[t]can be expressed as Pr{h[t]=hζ}=ωζ,whereωζ∈[0,1],andζ=1,···,Hfor the block fading channel.We also define=[ω1,ω2,···,ωH].TheSchedulerdetermines the number of data packets to transmit based on the current queue lengthq[t]and channel stateh[t].
In many communication systems,the latency decreases as the instantaneous service rate increases,while the power consumption increases.Our goal is to achieve the minimum latency subject to the energy constraint.Therefore,we have the following CMDP problem
whereπrepresents a scheduling strategy,which describes a mapping from a state to probabilities of choosing available actions.DπandEπdenote the average latency and energy consumption under strategyπ,respectively.The average power constraint is denoted byEth.We will show that the queue length corresponds to latency and it will be considered as a part of the reward function in Section III.
In this section,we first formulate the decision-making problem as an infinite-horizon CMDP,and then we describe the Q-greedyUCB algorithm.Then,we analyze the convergence property of the Q-greedyUCB.Finally,we introduce how to update the Lagrange multiplier for a given power constraint.
We model the RL framework on top of the system model presented in Section II,as shown in Figure 3,where theScheduleris the agent,the queue lengthq[t]corresponds to the statestand the number of packets to transmit in each time slott,c[t]is the actionat.
Figure 3.The RL structure of a communication system.
Figure 4.The process of obtaining an optimal Lagrange multiplier and an optimal scheduling policy that satisfies
In this paper,the scheduling problem is modeled as an infinite-horizon CMDP.Our CMDP model is defined by a 4-tuple,denotes the finite set of states(i.e.,state space)andrepresents the finite set of actions(i.e.,action space),andat=Prepresents the transition probability matrix whereis the probability of moving from current statesto next states′under actiona.The transition probabilityis not known since the environment information is not known in advance.Ris the reward matrix,whereR(s,a)denotes the immediate reward for the present statesunder actiona.The number of states and the number of actions are denoted byand,respectively.For example,suppose that we setB=6,M=3 andC=3.If current states=1(i.e.,queue length equals to 1),then the action space is.If current states=2(i.e.,queue length equals to 2),then the action space is.
Problem P1 in Eq.(3)is a constrained optimization problem.In order to obtain the optimal solution,we use the Lagrange multiplier method to solve it.By using the Lagrangian relaxation technique,Eq.(3)is transformed into the following unconstrained problem.
whereλ≥0 is the Lagrangian multiplier.
Since the termλEthhas no effect on Eq.(4)under a givenEthand specificλ,this term can be removed when we solve P2.Then,the Eq.(4)evolves as
Letdtandetare the delay and the energy consumption in time slott,respectively.Then,we have the average delayDπand average energy consumptionEπunder policyπ,as below:
Then,we describe the immediate rewardR(st,at),as below:
We useRπ(st,at)when we indicate the reward of a specific policyπ.The Little’s law states that the queue length is equal to the average delay times the data arrival rate[25].Therefore,the delay can be expressed as(i.e.,the queue length is proportional to the average delay).The power consumption under different channel conditions can be calculated by,which is a widely used function for calculating power consumption in the wireless communication system[9].
RL requires an exploration mechanism that feeds experience(samples)for diverse actions.The most common method is to use theε-greedy policy that chooses an action randomly with some probabilityεand chooses a historically optimal action with probability 1?ε[26].However,due to the randomness of theε-greedy policy,the second-best action and the worst action can be chosen with the same probability.To avoid such inefficient exploration,the upper confidence bound(UCB)method[27]is used and proven to be asymptotically optimal to maximize the expected cumulative reward.UCB is a type of multi-armed bandit algorithm.The UCB algorithm assigns a confidence bonus to each action based on the action values observed so far.The higher the bonus value assigned to an action,the lower the agent’s confidence in the action and the action will be selected more frequently in the future[27].There are many variants of the UCB algorithm.Here,we use the most standard one,as below:
whereμ(st,at)denotes the average reward of actionatat statestup to time slott.kindicates the number of times that(st,at)has been selected up to timet.We defnie the square-root termas the conf-i dence bonus,which is measures the uncertainty of the empirical mean of the current state-action pair(st,at)in time slott.In other words,this term measures how much the true mean of this action can be larger than the empirical mean.We will usefor this term.We will also usebk(st,at)to represent the confidence bonus term of the state-action pair(st,at).The parameterσ >0 controls the degree of exploration.ι(k)=ln(SAkt/δ)is a log factor.The valueδis approximately an upper bound on the probability of the event.Ifδis chosen to be very small,and then the algorithm will exploit more(i.e.,choose the best action in the history)and ifδis large,then the algorithm will explore more frequently.Next,we formally describe the Q-greedyUCB algorithm.
From problem P1 and Eq.(6),we can see that this is an infinite-horizon CMDP with anaverage reward criterion.This implies that the algorithms proposed in[18,19](i.e.,traditional Q-learning with the UCB algorithm)are not suitable for solving this problem since both algorithms are based ondiscounted reward criterion.Therefore,we consider the average reward criterion in our paper.Based on the above considerations,we propose the Q-greedyUCB algorithm,which is shown as the procedural form in Algorithm 1.We mix theε-greedy policy with the UCB policy in lines 5-9.In other words,the agent randomly chooses an action with probabilityεtat timetand executes the UCB policy with probability 1?εt.The parameterεtis usually set to be decreasing intto diminish exploration,such as.We can see that the UCB policy in Q-learning can be viewed as a special case of the Q-greedyUCB whenεt=0.
Q-greedyUCB uses the temporal difference method to update the action-value function.The update formula of Q-greedyUCB for the Q-function is given by
wherest+1is a new state after actionatis executed in the present statest,anda′is the action to be selected inst+1.Note that the confidence bonus termbkis added to the update term from the standard Qlearning.We remind thatk=Nt(s,a)represents the number of times that the state-action pair(s,a)is experienced up to time slott.γk∈[0,1]represents the step size when the state-action pair(st,at)is visitedktimes.γk=0 indicates that the agent learns nothing,while whenγk=1,the agent only considers the current estimate.In RL,the choice of the step size is crucial.In general,we set the step size equal to some constant value.However,it is not easy to obtain an optimal strategy when the problem is stationary[28,26].Therefore,a feasible method is to set the step size to decrease over time.We setγk=φ/(k+θ)in the proposed algorithm.In the standard Q-learning algorithm,the step size depends on current timet(e.g.,γk=1/t).However,for the Q-greedyUCB algorithm,we need to consider how many times each state-action pair is selected up to timet.Therefore,we set the step size asγk=φ/(k+θ)for some positive constantsφandθ.In Section 3.3,we will discuss the required properties of the step size to guarantee convergence.The term maxv∈A(i)Q(i,v)denotes the optimal average reward expected to converge in each iteration,whereirepresents the reference state.Since any state in the state space can be used as a reference state,we choosei=0 as the reference state in this paper.In line 14 of Algorithm 1,represents the historical minimum value of the Q-function.
Our Q-greedyUCB algorithm decides the best action to play in a certain state by estimating the value of the state-action pairQ(s,a)(i.e.,Q-value).Each Qvalue is stored in a matrix with the sizeS×A,which we call the Q-table.Before the algorithm starts learning,we need to set an arbitrary initial value for eachQ(s,a).In this paper,we set allQ(s,a)=0 initially.At the beginning of each time slot,the agent chooses an action based on our proposed exploration policy,which corresponds to lines 5-9 of the Algorithm 1.An agent that executes an action in a state will receive a reward and enters a new state,and then Q-table is updated.Each Q-value will not change after an adequate number of iterations.In other words,all Q-value in the Q-table will converge to the optimal value.Letπ?denote an optimal scheduling policy.We can get an optimal strategy according to optimal Q-values,which is given by
Algorithm 1.Q-greedyUCB.1:Parameters:δ,σ,εt.2:Q(s,a), Q(s,a),N(s,a) ← 0,?s∈ S,a∈A(s).3:for t=1,2··· do 4:Generate a uniform random number r∈(0,1),5:if r <εt then 6:Choose an action at at random.7:else 8:Choose an action at ←arg maxa′ Q(st,a′).9:end if 10:Observe reward R(st,at)and move to state st+1.11:Nt(s,a)←Nt(s,a)+1.12:k ←N(st,at),ι(k)←lnSAkt δ,bk ←σ images/BZ_27_1540_1168_1585_1214.png ι(k)k.13:Update Q(st,at)according to Eq.(9).14:Q(st,at)←min(Q(st,at),Q(st,at)).15:end for
For the UCB exploration policy,the performance of the Q-greedyUCB algorithm can be measured by the cumulative regret.The cumulative regret is defined as the difference between the total expected reward using an optimal policyπ?forTrounds(or iterations)and the total expected reward obtained by an online algorithm overTiterations.Defineas theaverageexpected reward of a policyπup to timeT,and we have
whereR?is the optimal average reward per iteration by using an optimal policyπ?.We will show that the Q-greedyUCB algorithm obtains smaller total regret than Q-learning withε-greedy,R-learning and the ARL algorithm in Section IV.
Now,we discuss the convergence of the Q-greedyUCB algorithm.In[29],for the average reward problem,the author has proposed a general framework for proving convergence based on the Ordinary Differential Equations(ODE)method,i.e.,the convergence theorem for the asynchronous algorithm.This theorem states that for an asynchronous algorithm withaverage reward criterion,the algorithm converges to an optimal policy if the following two assumptions hold.We first introduce these two crucial assumptions from[29],and then prove that our proposed algorithm satisfies these two assumptions.
Assumption 1.The step size γk satisfies the following:
1.γk is an ideal tapering step size,i.e.,
(a).
(b)γk+1≤γk for sufficiently large k.
(c)There exists g∈(0,1)such that
(d)LetThen,for all g∈(0,1),
2.There existsΔ>0,for each state-action pair(s,a)to be updated infinitely often such that
where Nt(s,a)is the number of times that the state-action pair(s,a)has been selected prior to time t.
Assumption 2.A scalar real-valued function f has the following properties:
1.f is Lipschitz continuous,i.e.,
2.f(x+ru)=f(x)+rf(u),for.
3.f(u)<0.1
where u=(1,···,1)is a vector whose entries are all 1.
Based on the above assumptions,we have the following.
Theorem 1.If γk=φ/(k+θ)for some positive constants φ and θ,by Algorithm 1,(1)Assumption 1 holds,(2)the sequence Q(st,at)is bounded,and(3)Q converges to Q?.
Proof.The proof is based on the convergence theorem for the asynchronous algorithm proposed in[29].We set the step sizeγk=φ/(k+θ)for some positive constantsφandθ.We first prove that the step size satisfies Assumption 1.Based on Assumption 1,we can use the analysis of Borkar[30]to establish the relationship between Eq.(9)and ODE,and then we prove that this ODE satisfies Assumption 2.
In Eq.(5),we can considerλas a tradeoff factor between latency and power consumption.Letλ?denote an optimal Lagrange multiplier that satisfiesλ?=inf,wheredenotes an optimal policy under a givenλandis the average power consumption for policy.Then,the average power consumptionof a policyunder a givenλup to timeTcan be expressed as.
Therefore,as discussed in[31],the Lagrange multiplier that satisfies the problem P1 can be obtained by using the Robins-Monro algorithm[32],which is a stochastic approximation algorithm.Then,the Lagrange multiplier can be updated by
where the stepsize can be set asρ=1/n.nrepresents the iteration index.
Based on the above analysis,we can obtain the optimal scheduling policy for a given power constraintEth.Also,if we change the value ofλ,we can get different average power and average delay.Therefore,Algorithm 1 and Eq.(17)can be used to obtain an optimal scheduling policy satisfying the power constraints.The overall process of obtaining an optimal Lagrange multiplier and an optimal scheduling policy that satisfiesEπ?λ ≤Ethis shown in Figure 4.
Figure 5.An optimal scheduling policies of the ARL,Q-learning,R-learning,policy iteration,and the QgreedyUCB algorithm.(σ=4,δ=0.01,and λ≈1).In Figure 5a,α=[0.45,0.05,0.05,0.35,0.1].In Figure 5b,α=[0.4,0.05,0.1,0.3,0.1,0.05].
In this section,we present the simulation result.To evaluate the performance of the Q-greedyUCB algorithm,we implement a MATLAB simulator with different input parameters.There are two experiment cases:i)one that changes the maximum buffer sizeB,the maximum number of packets in each data arrivalM,and the maximum number of transmitted packetsCunder fixed arrival rate.ii)another that changes the arrival rateunder fixedB,M,andC.
For the simulation scenario,we consider the case whereC=4 andC=5.The caseC=4 means that we can adopt four optional modulations to transmit 1,2,3,or 4 packets in a timeslot,respectively.We know that it is impossible to obtain perfect channel state information at the transmitter in practical wireless communication system.Generally,the channel state information is quantized into discrete values[10].Therefore,we setH={0.5,1,1.5,2}and=[0.35,0.25,0.28,0.12].To verify that the QgreedyUCB algorithm can obtain an optimal scheduling policy,we run the policy iteration(PI)algorithm proposed in[5].We also compare our algorithm with the standard Q-learning proposed in[28],Rlearning proposed in[21]and the ARL in[22].We setandEth=3.3.We usefor the power consumption.In our simulations,we set the stopping criterion asλn+1?λn ≤0.0001.We find that the optimal strategy generated by Q-greedyUCB is exactly the same as the optimal strategies generated by PI,ARL,R-learning and Q-learning,an optimal scheduling policy is shown in Figure 5.
Figure 6.The average reward of Q-learning,R-learning,Q-greedyUCB,and ARL over time slots and the optimal Average Reward(AR)from policy iteration(PI).(σ=4,δ=0.01,and λ≈ 1.In Figure 6a,=[0.45,0.05,0.05,0.35,0.1].In Figure 6b,=[0.4,0.05,0.1,0.3,0.1,0.05]).
Based on the input parameters in Figure 5,the average reward curve during the learning period for ARL,R-learning,Q-learning,and Q-greedyUCB algorithm obtained by Eq.(11)is shown in Figure 6.It is shown that the average reward eventually converges to -4.9 and -5.5(i.e.,optimal average reward)for all algorithms in the simulated scenarios.From the trend of the curve in Figure 6,we can conclude that the average reward of the Q-greedyUCB,Q-learning,R-learning and ARL will eventually meet with the optimal average reward line over time.However,the simulation results demonstrate that the regret of Q-greedyUCB is smaller than Q-learning,R-learning and ARL.At time step 1.0×106,the regret of Q-greedyUCB is about 4%of the regrets of Q-learning.In other words,the performance of the Q-greedyUCB is 25 times better than that of Q-learning,R-learning and ARL in terms of regret.Also,we can see that when the time step exceeds 106,the regret of the Q-greedyUCB algorithm is almost zero,which means that the agent always chooses an optimal scheduling policy after sufficient iterations and does not make unnecessary exploration.
Figure 7.The average reward of the Q-greedyUCB at different arrival rates(B=10,M=4,C=5,σ=4,δ=0.01,and λ≈1).
The average reward curves generated by Q-greedyUCB are shown in Figure 7,with=[0.7,0.1,0.1,0.05,0.05],=[0.25,0.5,0.1,0.1,0.05],=[0.25,0.25,0.1,0.25,0.15]and=[0.05,0.3,0.2,0.1,0.35](Other parameters remain unchanged).Also,the average data packets arrival rate corresponding to the above distributions are 0.65,1.2,1.8 and 2.4,respectively.It is demonstrated that as the traffic rates increases,the average reward of the Q-greedyUCB can reach the optimal reward under different data packets arrival distributions.Also,the average reward decreases as the average data arrival rate increases since the system workload will increases asincreases.
The Q-greedyUCB algorithm can obtain an optimal scheduling policy when the environment is dynamically changing over time,as shown in Figure 8.For Q-greedyUCB,ARL,Rlearning and Q-learning,we change the arrival rates from=[0.25,0.25,0.1,0.25,0.15]to=[0.25,0.5,0.1,0.1,0.05]at time step 1.0×106.We can see that the Q-greedyUCB outperforms ARL,Rlearning and traditional Q-learning algorithm in terms of the average reward after the traffic arrival ratehas changed.Moreover,afteris changed2,the Q-greedyUCB algorithm has smaller regret than Qlearning,ARL and R-learning.
Figure 8.The average reward curve in a dynamic environment,in which the traffic arrival rate is changed at time step 1.0×106.B=10,M=4,C=5,σ=4,δ=0.01,and λ≈1.
In this paper,to find an optimal scheduling policy to minimize the delay for a given power constraint in commucication system,we modeled a single-queue single-server communication system.Also,in order to overcome the disadvantages of traditional RL algorithms,which have high regret during the training processes,we proposed a novel RL algorithm called Q-greedyUCB.In the proposed algorithm,we combine theε-greedy and UCB algorithms as exploration policy for more efficient learning.We first mathematically prove the convergence of our Q-greedyUCB algorithm.Then,to validate that the proposed algorithm can obtain an optimal scheduling policy,through simulation we compared it with the PI algorithm proposed in[5],the Q-learning algorithm withε-greedy,the Rlearning algorithm in[21]and the ARL algorithm proposed in[22].The simulation results also show that our method can guarantee little performance loss during the learning process compared with the Q-learning algorithm withε-greedy,R-learning and ARL.
Although we have improved Q-learning,there are some limitations.The ARL,R-learning,Q-learning and Q-greedyUCB algorithms are tabular solution methods.The tabular method is suitable for solving sequential decision problems where the sizes of the state and action spaces are relatively small.If the dimensions of the state and action spaces become large,this method will fail or take a long time to converge due to thecurse of dimensionality[26].In the future work,we will use function approximation for our reinforcement learning method to tackle this problem.Although we have shown that Q-greedyUCB eventually converges to an optimal policy,we have not mathematically shown that the convergence speed of QgreedyUCB is faster than other methods.In the future work,we will also work on the analysis of the Q-greedyUCB in terms of the convergence rate.The impact of the step sizeγton the convergence speed will also be studied.Furthermore,we will extend our system model and mathematical results to more generalized and realistic settings(e.g.,multiple channel states,general traffic arrival models,etc.).
This work was supported by the research fund of Hanyang University(HY-2019-N).This work was supported by the National Key Research&Development Program 2018YFA0701601.