亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

A Heterogeneous Information Fusion Deep Reinforcement Learning for Intelligent Frequency Selection of HF Communication

2018-09-06 06:25:22XinLiuYuhuaXuYunpengChengYangyangLiLeiZhaoXiaoboZhang

China Communications 2018年9期

Xin Liu, Yuhua Xu, Yunpeng Cheng,＊, Yangyang Li, Lei Zhao, Xiaobo Zhang

1 Guangxi key Laboratory Fund of Embedded Technology and Intelligent System, Guilin University of Technology, Guilin 541007, China

2 College of Communication Engineering, Army Engineering University of PLA, Nanjing 210007, China

Abstract: The high-frequency(HF) communication is one of essential communication methods for military and emergency application. However, the selection of communication frequency channel is always a difficult problem as the crowded spectrum, the time-varying channels, and the malicious intelligent jamming. The existing frequency hopping,automatic link establishment and some new anti-jamming technologies can not completely solve the above problems. In this article, we adopt deep reinforcement learning to solve this intractable challenge. First, the combination of the spectrum state and the channel gain state is defined as the complex environmental state,and the Markov characteristic of defined state is analyzed and proved. Then, considering that the spectrum state and channel gain state are heterogeneous information, a new deep Q network (DQN) framework is designed, which contains multiple sub-networks to process different kinds of information. Finally, aiming to improve the learning speed and efficiency, the optimization targets of corresponding sub-networks are reasonably designed, and a heterogeneous information fusion deep reinforcement learning (HIF-DRL) algorithm is designed for the specific frequency selection.Simulation results show that the proposed algorithm performs well in channel prediction,jamming avoidance and frequency channel selection.

Keywords: HF communication; anti-jamming; intelligent frequency selection; markov decision process; deep reinforcement learning

I. INTRODUCTION

The high-frequency(HF) communication [1]plays a important role in military and emergency communications because of its advantages such as long distance communication,strong survivability, high mobility , convenient deployment, etc. However, the selection of HF communication frequency faces great challenges due to the following three reasons: 1)The time-varying characteristics of HF channel caused by fluctuating altitude and density of the ionosphere, 2) Severe interference exists as the contradiction between the large number of users and limited frequency resources, 3)Potential malicious and intelligent jamming devices in the field of military communications.

In view of the above challenges, frequency hopping (FH) [2-5] and automatic link establishment (ALE) [6-10] are the main methods adopted in the current HF communication system. Where the FH technology resist the channel fading and jamming by changing communication frequency pseudo-randomly.In order to enhance the adaptability of FH further, the adaptive frequency hopping (AFH)technologies [3, 4] and the intelligent frequency hopping technologies [5] have attracted much attention in HF communication. However, these studies do not consider the existence of intelligent jamming.

Another widely applied and proven effective method is ALE, which automatically establishes communication on a selected channel from a set of channels. Up to now, ALE has evolved from the second generation (2G)ALE [6], the third generation (3G) ALE [7,8] to the fourth generation (4G) [9, 10]. Especially in 4G ALE, cognitive technology has been widely adopted [10]. No matter which generation of ALE technology, the basic idea is to sounding the channel first and then select the appropriate channel for communication.As sounding channels takes a long time, ALE may not work normally when the environment changes rapidly, e.g., when there is a tracking jamming device near the receiver.

To cope with the rapid changes of the environment, a number of existing studies, e.g., opportunistic spectrum access [11, 12], dynamic spectrum access [13] and spatial-temporal opportunity detection [14], have considered the spectrum dynamics caused by the primary users or other competing users. However, these dynamic characteristics are obviously different from those caused by malicious jammers. With the development of artificial intelligence technology, some efficient online learning techniques, such as reinforcement learning [15],are gradually used in the field of anti-jamming communication. For example, Q-learning is always adopted to combat against sweeping and tracking jamming patterns [16, 17]. However,Q-learning is not able to tackle more complex jamming patterns such as intelligent jamming as weaker ability to handle complex environmental states. Motivated by the deep reinforcement learning (DRL) technique for coping complex environmental states in [18, 19],an anti-jamming deep reinforcement learning(ADRL) algorithm is proposed in [20], which innovatively uses the spectrum waterfall as the input of deep neural network, and finally achieves the optimal anti-jamming decision under various intelligent jamming patterns.Unfortunately, ADRL also can not be directly applied to HF communication environments as the time-varying characteristics of channels is not considered.

In this article, we consider the frequency channel selection problem under the complex HF communication environment, in which the channel states are time-varying and the spectrum environment is full of mutual interference and malicious jamming. In order to solve this intractable challenge, first, the combination of the spectrum state and the channel gain state is defined as the complex environmental state, and the proof that the defined state is a Markov process is given. Then,considering that the spectrum state and channel gain state are heterogeneous information,a new deep Q network (DQN) framework is designed, which contains multiple sub-networks to process multiple different dimension input. Finally, aiming to improve the learning speed and efficiency, the optimization targets of corresponding sub-networks are reasonably designed, and a heterogeneous information fusion deep reinforcement learning (HIF-DRL)algorithm is designed for the specific frequency selection. Simulation results show that the proposed algorithm performs well in channel prediction and jamming avoidance. The main contributions are summarized as follows:

? A composite environment state including spectrum state and channel gain state is defined, which is able to describe the correlation of the gain state of HF channels in time and the frequency dimension and the spectrum changing rule due to jamming and dynamic frequency decision of other users.In addition, in order to illustrate that the frequency selection in this article is a Markov decision process (MDP), the proof that the combination state is a Markov process is given in detail.

? A heterogeneous information fusion DQN(HIF-DQN) framework for heterogeneous information coping is designed. As the input of the neural network, the spectrum state and the channel gain state are two different types of input information, hence, two different processing structures is designed.The convolutional neural networks (CNN)is used to process the spectrum information in order to extract the feature the jamming and interference, and the fully connected neural network is used to process channel gain state so as to realize the prediction of the channel gain. A subsequent fusion network is used to achieve the comprehensive judgment for the channels.

? A HIF-DRL algorithm for heterogeneous information fusion learning is proposed. By designing the different return values and optimization targets for each sub-networks of the HIF-DQN, the direction of training is more directional, and the speed and effi-ciency of learning are improved.

The rest of the article is organized as follows. In Section II, we present the transmission and jamming models and the markov decision process of frequency selection problem in HF communication. In Section III, we present the intelligent frequency selection scheme from three aspects of system framework, neural network structure and learning algorithm.In Section VI, simulation results are presented.Finally, we present discussion and draw conclusion in Section V.

II. SYSTEM MODEL

2.1 Transmission and jamming models

Fig. 1. Transmission and jamming models.

We consider the transmission of one user (a legitimate HF transmitter and its receiver) in the presence of M jammers and N interference users (other legitimate transmitters), as shown in figure 1. At time slot k, under the guidance of the receiver (The guidance information can be feedback through ACK or a reliable control link), the transmitter chooses a frequency channel denoted by ak∈{1,2,…,A} to send signals with a given power p. Considering the time-varying characteristics of HF channels,we use gk(ak) to represent the channel power gain from the transmitter to the receiver at channel ak. Similarly, we useandto represent the gain from the jammer m and interference user n to the receiver respectively, where∈{1,2,…,A}and∈{1,2,…,A} denote the intelligent or fixed actions of jammer m and interference user n. Letdenotes the transmission power of jammer m, the sum of jamming power received by the user can be expressedwhere δ(x) is an indicator function that equals 1 if x is ture, and 0 otherwise. Letdenotes the transmission power of interference user n, the sum of interference power received by the user can be expressed as. Hence, the received SINR of the user can be expressed as:

where σ(ak) is the receiving noise power at channel ak. Let βthdenote the required SINR threshold for successful transmission, the utility function of is defined as

where μ is the transmission rate of user, LJand LIrepresent the loss being jammed and being interferenced respectively. The impact on the transmission of the user is the same whether is being jammed or interferenced, but the impact on other legitimate users is different, so LJand LIare expressed in different values in this article.

2.2 Markov decision process

The frequency selection problem in HF transmission environment mainly faces two challenges, namely, the unknown dynamics of the available spectrum and the unknown dynamics of the channel gain. But these changes often have some time or frequency correlation, otherwise we can not learn to predict the optimal channel at the next moment. For example, the change process of jamming actions, which is the main factors of spectrum dynamics, is modeled as a Markov procedure in [16] and[17] first, and then the reinforcement learning method is adopted to learn the best anti-jamming actions. However, the optimal frequency channel selection problem in HF band is much more complicated as the channel state is not only related to the channel state at the previous time. The intelligent jammers or other intelligent legitimate transmitters may select their actions by observing a long historical channel states. Similar to paper [20], we extend the temporal dimension of the channel state to grasp the changing characteristics of the environment from the macro level. First, we define the instantaneous environmental state as sk={rk, gk}, where rk={r1,k,r2,k,… ,rA,k}denotes the receiving spectrum of the receiver(Spectrum resolution is the channel bandwidth.) and gk={ g1,k,g2,k,… ,gA,k} denotes the channel power gain from the transmitter to the receiver at each channel. According to the definition in transmission and jamming models, the spectrum value at channel l can be expressed as

Then, by combining current and historical instantaneous environmental state, the time-expansion environmental state is defined as

where K denotes the number of historical states of backtracking, the matrix Rkreflects time-expansion spectrum state, and the matrix Gkreflects time-expansion channel gain state.

Lemma 1: If K is big enough, the stochastic process represented by matrix Gkis a Markov process.

Proof: Let Gk,Gk?1,… ,Gk?∞r(nóng)epresent the historical value of matrix G , then the transfer probability from Gkto Gk+1can be expressed as

When the time interval K is much larger than the coherent time of channel, the correlation of channel gain at two time points tends to 0. As the channel gain is complex Gauss random variable (uncorrelated and independent is equivalent), gk+1is independent with gk?K,… ,gk?∞} after {gk,gk?1,… ,gk?K+1} is determined. At last, the transfer probability can be rewritten as Thus, stochastic process Gkis a Markov process.

Lemma 2: If K is big enough, the stochastic process represented by matrix Skis a Markov process.

Proof: Let Sk,Sk?1,… ,Sk?∞r(nóng)epresent the historical sequence of state S , then the transfer probability from Skto Sk+1can be expressed as

According to Lemma 1, when Gkis determined, transfer probability to Gk+1will be no correlation with Gk?1,…,Gk?∞, and rk+1is calculated directly by gk+1and ak, hence we have

According to equation (3), rk+1is also affected by the actions of jamming users and interference users. Supposing that those opponents only uses limited historical period (smaller than K) of states, which means that rk+1is independent of rk?K,…,rk?∞after rk,…,rk?K+1is determined. Therefore, the transfer probability can be written as

Thus, the stochastic process represented by matrix Skis a Markov process.

Based on the definition of user utility and the environment state, coupled with the defined environmental state is the Markov process, the problem of intelligent frequency selection can be modeled as MDP, where S∈{S1, S2,…} is the temporal spread environment state, a ∈{1,2,…,A} is the frequency action of user, P(S′|S,a) is the transition probability from the current state S to S′when taking action a, and u is the immediate reward.

The aim of the agent is to interact with the environment through selecting actions in a way that maximizes discounted future rewards which is defined aswhere γ is the discount factor. The corresponding optimal action-value function is defined as

in which π is a policy mapping distributions over actions. The optimal action-value function obeys an important identity known as the Bellman equation which is expressed as

As same as most reinforcement learning alogirthms, the action-value function is iteratively updated based on the Bellman equation,And this kind of value iteration algorithms converge to the optimal action-value function,Qi→ Q?as i→ ∞ [18].

III. INTELLIGENT FREQUENCY SELECTION SCHEME

Similar to the anti-jamming communication scheme in [20], deep reinforcement learning is used to solve the frequency selection problem,and the the system framework is shown in figure 2. First of all, the basic working process of the system needs to be briefly described.Assuming that the receiver and transmitter have identified the same channel action akthrough ACK or control link, the next step is to calculate the current reward uk, observe the spectrum rkand the channel gain state gk. And then Rkand Gkare constructed by combining rkand gkwith their historical information respectively. Finally, a reinforcement learning process with the input of environment and reward is implemented by the DQN to achieve the next frequency decision.In the above process, spectrum analysis and reward calculation can be obtained through the observation of the receiving end, but the channel gain estimation needs the cooperation of the transmitter. Therefore, at the transmitter side, in addition to sending normal communication signals, it is necessary to transmit probe signals in other channels at the same time. In order to avoid interference to other users, the direct sequence spread spectrum(DSSS) waveform with low power is used for probe signal. Moreover, it does not transmit the probe signal in all channels, but randomly selects partial channels bkin the communication band, e.g., bk={b1,k,b2,k,… ,bPN?1,k} ,where bq,krepresents the index of the channel of probe signal at slot k, and PN denotes the probe number of channels.

In this way, at each moment, some channels will not be detected, such as a graphical gkwith a color square is probed, while the white representation is not probed. For channels without detection, the estimation of channel gain can be automatically interpolated through neural network learning. More specifically,there is no need to add any other probe signal when PN=1, because the communication signal can be used as a probe signal.

Fig. 2. The system framework for intelligent frequency selection.

For the whole system framework introduced above, DQN is undoubtedly the core module. However, due to the input S contains two kinds of heterogeneous state information,namely the channel gain state G and the spectrum state R, there are potential problems in adopting the traditional DQN structure directly. Intuitively, those two kinds of raw input need to be processed differently, so that the key information for decision can be extracted more specifically. For example, the spectrum state R mainly provides historical information of jamming or interference of each channel,and the gain state G describes the changing rule of the transmission gain of each channel.

Based on the above considerations, as shown in figure 3, a heterogeneous information fusing DQN (HIF-DQN) is designed to solve the problem of optimal frequency channel selection in time varying environment with the presence of intelligent jamming, and the specific structural parameters at each layer are shown in table 1. It is noted that the entire network contains two kinds of layers, which are convolutional layer (Conv Layer) and fully connected layer (FC Layer). According to specific functions, the HIF-DQN can be divided into three sub-networks, the anti-jamming networks (AJN), the channel gain prediction networks (CGPN) and the fusion networks (FN).Correspondingly, after the networks parameter are well trained, the work flow of the optimal frequency decision is divided into three steps:

? 1) The spectrum state Rkin the environment state Skis inputted into the anti-jamming networks, and the evaluation of jamming or interference Xk={ X (Rk,1),… ,X (Rk,A)}is estimated by the anti-jamming networks, where X(Rk,a) is action-value function of the anti-jamming MDP, which will be introduced in detail in the following paragraphs.

? 2) The channel gain state Gkin the environment state Skis inputted into the channel gain prediction networks, and the estimation vector of channel gain Yk={Y (Gk,1),… ,Y (Gk,A)}is predicted by the CGPN, where Y(Gk,a) denotes the estimated channel gain at channel a when the history channel gain matrix is Gk.

? 3) The fusion network calculates the final optimal decision akbased on Xkand Yk.

For the updating process of network parameters, the traditional DQN constructs the target output by recording playback records first, and then updates the parameters in back propagation (BP) mode. However, the HIFDQN designed in this paper has many layers and complex structure. The direct use of BP algorithm in the whole network will make the monitoring effect not obvious or even difficult to converge. In order to make the training direction of each sub-networks more clear, the corresponding target values of each sub networks are constructed to realize the rapid updating of the network parameters of each layer. The specific target values of each sub-networks are designed as follows:

The anti-jamming networks.In order to decompose the complex problem of the frequency channel selection, we first design a sub-MDP model whose goal is not to select the channel with the highest throughput but to select the channel with the least jamming or interference. In this sub-MDP,Ris the environment state, a is the frequency action, and the immediate reward is defined asAccording to the the above description, the sub-MDP is exactly the same as the MDP in [20], except that the immediate reward is different. Therefore, similar to the method in[20], in order to enalbe the AJN to fit the action-value function X(R,a), the corresponding target value ηiXis designed as

Fig. 3. The neural network structure of HIF-DQN.

The channel gain prediction networks.The main task of the channel gain prediction networks is to predict the channel gain of each channel at the next moment. Therefore, the channel gain value at the next time can be directly taken as the target value,

where Front(G′) represents the front line vector of the matrixG′, e.g.,gk+1=Front({gk+1,gk,… ,gk?K}T). In essence,the work of this sub-networks is to predict the future channel gaingk+1based on the historical channel gain informationGk, which is not a difficult task for the neural network.And the loss function of CGPN is obtained by, whereis the network weights of the CGPN at the i-th iteration, and bqis the index of probe channel.However, in practical application, the acquisition ofGis not easy as it is unrealistic to probe all channels during each slots. One feasible way is to utilize the channel gain of the current channel, or to probe several channels randomly.

The fusion networks.The fusion networks is responsible for integrating the results of channel gain and jamming state evaluated by the first two networks respectively, and ultimately determines the optimal frequency channel. So the immediate reward is the utility function defined by equation (2). Assuming that θiFis the network weights of the fusion networks of the i-th iteration, the corresponding target value is designed as

And then, the loss function of FN is obtained by. Although the above is very similar to that of the traditional DQN, it is only used to update the weights of the fusion network, rather than the weights of the whole network.

Based on the above definition of the target value, the update process of each sub-networks weights is described in the following part. First, experiencesat each time slot k is stored in data set Dk=(e0, e1,… ,ek). Unlike traditional DRL algorithms, some intermediate variables(Xk,Yk), probe action bkand the reward of sub-networks () are added in ek. When the size of Dkis big enough, the update process of network weight is started. Similarly,mini-batch of records are randomly selected from Dk, and then the target values are constructed. The difference is that the algorithm of heterogeneous information fusion learning needs to construct target values separately for each sub-networks based on equation (11),(12) and (13). Finally, the loss and its partial derivatives are calculated, and the weights of each sub-networks are updated by gradient algorithm. By summarizing the decision and updating processes, the heterogeneous information fusion deep reinforcement learning algorithm (HIF-DRL) for intelligent frequency selection is given in Algorithm 1.

IV. NUMERICAL RESULTS AND DISCUSSIONS

4.1 Simulation parameters

In the simulation setting, the user coexists with 2 jammers and 8 other interference users in the HF communication environment containing 20 channels of 3kHz, that is,M = 2,N = 8,A =20.

For the convenience of simulation, all participants adopt the same slot structure,in which the slot time is set to be 50ms. All participants use fixed power to send signals,specifically, power of user p=20dBm, power of interference users pI=20dBm, power of jammers pJ=30 dBm, and power of noise is 10dBm of all channels. The channel model uses the Gaussian scattering function in [22],where the delay spread is set to be 0.5ms and the Doppler frequency spread is set to be 1.0 Hz. The root mean square of all gain variables is set to be 1.0. The user records spectrum and gain information within the 2 seconds range as the basis for learning and decision, which means K=40. The user utility related parameters are set as follows: μ=1.0, LI=1.0, and LJ=0.5. In order to match the above specific parameters, the specific parameters of layers of the designed HIF-DQN are shown in table 1.

Interference users adopt an adaptive frequency mode, only when the channel does not satisfy the communication condition in two continuous slots , they switch to a new channel that meet the communication requirement.The intelligent decision-making processes of interference users are not simulated here, but assumes that they can get the best decision directly. Jammers adopts an intelligent frequency sweeping methods, which adjusts the direction of frequency scanning according to the spectrum occupancy on both sides of their current frequency channel.

4.2 Simulation results

In this section, the detailed numerical simulation results and analysis are given, and all the results are obtained by performing 500 independent trials and then taking the expectation.

Firstly, we study the channel prediction performance versus the probing number PN of channels and the comparison results are shown in figure 4. It is noted from the figure that the mean square error (MSE) of predictive channel gain decreases as PN increases, which is in accordance with our expectations. Based on more detection information, we may get better prediction results. But in practice, the increase of probing number PN will bring complexity and time cost. Fortunately, we can see that even if there is only one detection channel,which means that additional probing channels are not needed (The channel gain can be cal-culated directly from the current received signal), the performance of MSE still improves significantly with iteration. Also, it is noted that the MSE performance with PN=5 is close to that with PN=20, which illustrates that the effect of probing all channels can be almost realized by probing small part of channels. These results show that, although the parameters of channel model is unknown, the CGPN can achieve fine channel prediction in the case of incomplete historical information through continuous supervision and training.

Table I. The specific structural parameters of layers of HIF-DQN.

Fig. 4. The channel prediction performance versus versus the probing number PN of channels.

Secondly, we investigate the throughput performance of the HIF-DRL with different setting of PN and the comparison results are shown in figure 5. It is noted from the fig-ure that the throughput performances with different setting of PN are improving as the iterations, and achieve convergence after 2000 iterations. In the initial stage, because the environment is completely unknown, the performances of all parameters are basically the same with the random decision algorithm.But, after convergence, the throughput performances increase with the increase of PN.When the PN is set to be 20, the performance after convergence reaches 0.9, which means that even in the case of intelligent jamming and crowed spectrum occupation, the HIFDRL algorithm can still avoid the jamming and interference with a large probability, and select channels that meet transmission requirement. Even without special probing channel(PN=1), performance is also significantly improved ( the successful transmission probability is about 65%), which also illustrates the excellent performance of the HIF-DRL.

Fig. 5. The throughput performance of the HIF-DRL with different setting of PN.

Fig. 6. The performance comparison between HIF-DRL and ADRL.

Finally, the performance comparison between HIF-DRL (PN=20) and ADRL in[20] are given in figure 6, where ADRL-1 represents the algorithm that utilizes spectrum information R only, and ADRL-2 represents the algorithm that utilizes R and G simultaneously and regard them as the same type of inputs of the deep neural network. From the comparison of the performance after convergence, the performance of ADRL-1 is the worst because it does not consider the time-varying characteristics of the channel.The performance gap between ADRL-1 and HIF-DRL also quantifies the impact of channel time-varying states on throughput. Although ADRL-2 use spectrum and channel gain information as same as HIF-DRL, its performance is not as good as HIF-DRL as the same network structure is used for two kinds of information. From the comparison of the speed of convergence, ADRL-2 is significantly slower than ADRL-1 and HIF-DRL. The reason is that HIF-DRL designed different processing structures and optimization objectives for two kinds of inputs, and the goal of training is more prominent. For ADRL-1, the time-varying channel is coped as an random factor, and selecting the channel with lower jamming or interference for next time is a good decision,hence the speed of convergence is fast as simple goal. In summary, all the simulation results show the strong learning ability of HIF-DRL for frequency selection problem in HF communication environment.

V. CONCLUSION

In this article, we investigated the the frequency channel selection problem under the complex HF communication environment, in which the channel states are time-varying and the spectrum environment is full of mutual interference and malicious jamming. To accurately grasp the characteristics of the complex environment, we defined the combination state containing the spectrum state and the channel gain state. Considering that the spectrum state and channel gain state are heterogeneous information, we designed the HIF-DQN scheme and proposed HIF-DRL algorithm for processing and learning of heterogeneous information. Simulation results in various settings are presented to validate the proposed frequency channel selection approach. Future work on designing multi-user HIF-DRL algorithms is ongoing.

ACKNOWLEDGEMENT

This research work was supported by Guangxi key Laboratory Fund of Embedded Technology and Intelligent System under Grant No.2018B-1, the Natural Science Foundation for Distinguished Young Scholars of Jiangsu Province under Grant No. BK20160034, the National Natural Science Foundation of China under Grant No. 61771488, No. 61671473 and No. 61631020, and in part by the Open Research Foundation of Science and Technology on Communication Networks Laboratory.

China Communications2018年9期

China Communications的其它文章: DNN-Based Speech Enhancement Using Soft Audible Noise Masking for Wind Noise Reduction; Delay-Based Cross-Layer QoS Scheme for Video Streaming in Wireless Ad Hoc Networks; An MAC Layer Aware Pseudonym (MAP) Scheme for the Software De fined Internet of Vehicles; Asymptotic Analysis for Low-Resolution Massive MIMO Systems with MMSE Receiver; Golay Pair Aided Timing Synchronization Algorithm for Distributed MIMO-OFDM System; Joint Non-Orthogonal Multiple Access (NOMA) &Walsh-Hadamard Transform: Enhancing the Receiver Performance