Xiangpeng Li | Hong Yu | Yongfang Xie | Jie Li
1Chongqing key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications,Chongqing,China
2School of Information Science and Engineering,Central South University,Changsha,China
3School of Metallurgy and Environment,Central South University,Changsha, China
Abstract It is a common fact that data (features, characteristics or variables) are collected at different sampling frequencies in some fields such as economic and industry.The existing methods usually either ignore the difference from the different sampling frequencies or hardly take notice of the inherent temporal characteristics in mixed frequency data. The authors propose an innovative dual attention‐based neural network for mixed frequency data (MID‐DualAtt), in order to utilize the inherent temporal characteristics and select the input characteristics reasonably without losing information.According to the authors’knowledge, this is the first study to use the attention mechanism to process mixed frequency data. The MID‐DualAtt model uses the frequency alignment method to transform the high‐‐frequency variables into observation vectors at low frequency, and more critical input characteristics are selected for the current prediction index by attention mechanism. The temporal characteristics are explored by the encoder‐decoder with attention based on long‐ short‐term memory networks (LSTM). The proposed MID‐DualAtt has been tested in practical application,and the experimental results show that it has better prediction ability than the compared models.
The problem of time series processing is a complex prediction problem. The main objective of time series prediction is to predict the specific goals by using most of the information in the historical data. This work is crucial in the field of finance and industry. The classical model commonly used for time series prediction is the autoregressive model (AR) [1], which predicts the performance of the variable in the current period by the performance of each period before the same variable.The improved algorithm of the AR model is the moving average model (MA), which focuses on the accumulation of error items in the AR model and it can also effectively eliminate the random fluctuations in the prediction. The combination of AR and MA is the autoregressive integrated moving average model (ARIMA) [2]; it does not directly consider the changes in other related random variables.In order to solve the problem caused by the second hypothesis of the constant variance of time series variables in traditional econometrics,Bollerslev proposed the generalised autoregressive conditional heteroskedasticity model (GARCH) [3], which is developed from autoregressive conditional heteroskedasticity model. In reality, the time series data contain a lot of variables with different sampling frequencies, such as economic data: quarterly gross domestic product (GDP) data, monthly consumer price index data, daily income data, etc. At present, most of methods for time series processing only consider the same frequency, however, the diversity information and volatility of high‐frequency in the data has been ignored, which will cause information loss. In order to model the mixed frequency data directly and make full use of these information, the mixed frequency data sampling model (MIDAS) [4] has been proposed by Ghysels, et al. on the basis of the distributed lag model. MIDAS can obtain rich information in the high‐frequency data,and avoid information loss to a certain extent.For now, it has shown good experimental results in macroeconomic prediction [5], volatility prediction [6], consumption prediction [7], price prediction [8], etc.
In fact, most of the variable relations in a system are nonlinear, especially in time series data [9]. Considering that artificial neural network(ANN)has the characteristics of data‐driven,self‐learning,self‐adaptive and anti‐jamming,especially in exploring complex non‐linear patterns. Lapedes proposed using ANN [10] to explore nonlinear patterns to process time series data, after which, Gonzalez Miranda and Burgess proposed to use multi‐layer perceptron[11]to predict volatility of the stock ibex35 index.Oliveira wrote a methodology for time series prediction model [12], five models in machine learning including neural network that were compared and experimented, and it is concluded that neural network can capture the non‐linear features well and has outstanding performance in time series prediction. This kind of traditional ANN only explores the non‐linear patterns without considering the temporal features. Recurrent neural networks (RNN) are specifically designed to explore temporal features, which have been wildly applied in various time series modelling and time series analysis. The common RNN has the problems of gradient disappearance and gradient explosion; fortunately, the long short‐term memory(LSTM) network solved this problem and has been more widely used. Chen, et al. used the LSTM network [13] to predict the return rate of China’s market,confirming that better priority than the feed‐forward neural network model in the prediction of sequence data.Fischer and Krauss used the LSTM network to predict the S&P500 index of stocks[14],and it is found that the LSTM network is better than the methods without memory classification, such as random forest, deep neural network, logical regression, etc.
It is found that the combination of traditional econometric method and deep learning method can achieve good result in time series data processing,and the effect of integrated model is also better than single model. Bildirici and Ersin proposed to combine ANN with autoregressive conditional heteroskedasticity model (ARCH), GARCH (G‐Generalized),EGARCH(E‐Exponent),TGARCH(T‐Threshold),PGARCH(P‐Power)and APGARCH(AP‐Asymmetric Power)[15].Kim and Won proposed to combine the LSTM network with multiple autoregressive conditional heteroskedasticity models [16],and made stock price volatility prediction on the kospi200 stock index and further,these models can also be extended to other fields. On one hand, econometric methods are used to extract the variables in economics,while on the other hand the neural network is used to explore the non‐linear patterns of time series data.These models can only process the data of equal frequency,however, the time series data is often with mixed frequency sampling.Xu et al.proposed the Midas‐ANN[17],in which the hybrid model can directly process the mixed frequency data without any pre‐processing and this is also the first time that neural network can be directly used for the mixed frequency data. The experiment proves that the model can be effective,and validity of the model is proved by experiments.In practice,the mixed frequency data are basically time series, such as financial data. Variables in these data are highly correlated in time, so temporal features should be taken into account when processing the mixed frequency data. The difference is that most of recent methods only consider the non‐linear patterns of variables in mixed frequency data,the temporal features are not considered nor are the input features selected. This is because when multiple input features are available,the network cannot explicitly select relevant input features to make predictions.
To address these aforementioned issues, an innovative neural network is proposed called MID‐DualAtt model to perform time series prediction for mixed frequency data. In this model, frequency alignment in MIDAS has been used to transform the high‐frequency variables into the low‐frequency observation vectors that contain most of the information in the high‐frequency data,because the frequency alignment here is not simply to sum or interpolate mixed frequency data. In neural network, the encoder‐decoder framework of two‐stage attention mechanism is adopted and the first stage of attention mechanism extract the relevant input features at each time step according to the previous encoder hidden state while the second one is used to select relevant encoder hidden states across all time steps. In this way, the MID‐DualAtt can adaptively select the most relevant input features as well as capture the long‐term temporal dependencies of the mixed frequency data appropriately. The practical application proves that the model has high utility value and broad application prospects. The idea of combining the deep learning method with the traditional econometric model provides a new idea for the processing of mixed frequency data.In particular,the main contributions of the present study are presented as follows.
? First, an attention‐based innovative neural network model called MID‐DualAtt is proposed. Then high‐frequency variables are converted into low‐frequency vectors through the frequency alignment, and this is a very efficient conversion to make the most of the information in the mixed frequency data; then neural network been used to make prediction.
? Second,to improve the accuracy of prediction,the encoder‐decoder framework with two‐stage attention mechanism is adopted as our neural network, which not only selects the most relevant input features, but also makes use of the temporal features in the mixed frequency data. The complete model is implemented by TensorFlow and can be trained using standard back propagation.
? Third, the realized prediction of China inflation index is predicted using the proposed model. To justify the effectiveness of the MID‐DualAtt, it is compared with other approaches using the same dataset, the design experiments show that our model has better prediction accuracy,and the proposed model has been proved effective.
The core idea of the MIDAS model is to assume that the information quantity of high‐frequency data observation points in the sample interval obeys the specific distribution form,and reduce the parameters to be estimated on the premise of retaining the information quantity of high‐frequency data.This model is based on the method of using time series data of different frequency for regression,which can effectively extract the information of high‐frequency data and use it for the prediction of low‐frequency time series. The idea of MIDAS comes from the distributed lag model.The regression equation of the model estimates the value of the dependent variable based on the value of the current and previous explanatory variables, and cleverly uses the method of ‘a(chǎn)ggregation parameterization’ to make the high‐frequency data as the explanatory variable of the low‐frequency data without integration.
Supposeqtis the time index of low‐frequency data,tmis the time index of high‐frequency data,andmis the multiple of frequency difference between low‐frequency data and high‐frequency data.For example,the multiple between quarter and month ism= 3,wis the time (subject to the time index of high‐frequency data) that can be advanced from high‐frequency data to low‐frequency data, which means thatm?wsteps can be advanced for prediction. For the basic MIDAS model, there are prediction models withhqsteps ahead, which can be divided into single variable and multivariables. For single variable,
Intensive parameterization ofcis the key to the MIDAS model,which commonly used distributed lag polynomials such as exponential Almon lag polynomial (5) and Beta polynomial(6) [18] [19].
The RNN is a kind of network with memory function, which especially suitable to deal with time‐related problems. In the RNN, neurons can receive information not only from other neurons, but also from themselves, which form a network structure with a ring. However, it suffers from gradient vanishing and gradient exploding problem.In order to solve these problems, RNN has been improved, where the most effective way is to add the gate mechanism.There are two main types of RNNs based on the gate mechanism: LSTM network and gated recurrent unit (RGU) network, which have overcome these limitations and achieved great success in various applications [20].
In a two‐layer feed‐forward neural network, connections exist between adjacent layers, while there are no connections between nodes in the hidden layer. In the simple RNN, the feedback connections from the hidden layer to the hidden layer are added. Suppose at timet, the input to the network isxt, the hidden layer statehtis not only related to the inputxtat the current moment, but also to the hidden layer stateht?1at the previous moment. Then, we havezt=Uht?1+Wxt+b,ht=f(zt), whereztis the input of the hidden layer,f(?) is the nonlinear activation function,Uis the state‐to‐state weight matrix, andbis the bias term.Usually the last two formulas are written asht=f(Uht?1+Wxt+b).
If the state at each moment is regarded as a layer of the feed‐forward neural network, the RNN can be regarded as a neural network sharing weights in the time dimension. The Figure 1 shows the RNN expanded according to time.
LSTM is a variant of RNN network, which has the following improvements. The new internal state, the LSTM network introduces a new internal state ofctspecially for linear recurrent information transmission,ct=ft⊙ct?1+it⊙~ct,ht=ot⊙tanh(ct),~ct=tanh(Wcxt+Ucht?1+bc),whereft,it,otare three gates to control the path of information transmission,⊙is the product of the vector elements,ct?1is the cell state of the previous time,~ctis candidate state which obtained by nonlinear function.
F I G U R E 1 Recurrent neural network based on time expansion
At each timet, the internal state of the LSTM network,ctrecords the historical information up to the current time. The three gates mentioned in the above formula are input gateit,forget gateftand output gateot. The gate mechanism is actually a binary variable between {0, 1}, where 0 represents the closed state in which no information is allowed to pass through while the opposite 1 represents the open state, which allows any information to pass through. In the LSTM network, the gate is a soft gate whose value is between (0, 1),indicating that the information can pass through in a certain proportion. The Figure 2 shows the structure of the LSTM network. The functions of the three gates in the LSTM network are as follows: forget gateftcontrols how much informationct?1needs to forget about the internal state at the last moment; input gateitcontrols candidate status of the current time~cthow much information needs to be saved;and output gateotcontrols how much information the internal state ofctat the current time needs to be output to the external stateht.
It has been found that the biological neural network in the brain has network capacity.People get all kinds of information from the outside world every time, such as vision, hearing,touch, and the human brain cannot process overloaded information at the same time with limited resources. Therefore,there are two important mechanisms in the brain nervous system, namely attention and memory mechanism. Attention mechanism is to filter out a lot of useless information through the information selection mechanism so as to retain valuable information.
The attention mechanism in deep learning is inspired by the work of the human brain.This mechanism was first proposed in the field of image vision, and its main idea is to give more important attention to the target area of focus. The attention mechanism also have been applied in the field of natural language processing[21],such as machine translation task,and the attention mechanism in the field of deep learning has been a research focus,and it generates a lot of work,such as sequence to sequence model is used to transform input sequence for new output sequence[22].In essence,the attention mechanism is a process of weight distribution. Specifically, the calculation of attention mechanism in the neural network is divided into two steps: calculating the distribution of attention on all input information and then calculating the weighted average of input information according to the distribution of attention.
First, given a task related query vectorq, the attention variablezis used to represent the index position of the selected information.z=imeans that theith input information is selected. At the same time, a soft information selection mechanism is adopted, so givenqandX, the probability of choosing theith input information isαi,
whereαiis called the distribution of attention, ands(xi, q) is the attention rating function. We can calculate it in the following ways stated here: Additive model:s(xi, q) =vTtanh
F I G U R E 2 Long short‐term memory network architecture
This section,proposes a hybrid MIDAS Dual Attention(MID‐DualAtt) model for processing mixed frequency data and the structure of the model is shown in Figure 3.In this model,the mixed frequency dataxis directly taken as the input without any pre‐processing.Through the frequency alignment part,the high‐frequency variable is converted into the observation vector at low frequency, and then sent to the neural network part. The neural network adopts an encoder‐decoder framework with double phase attention mechanism,which was used to select input features and extract temporal features. In particular, through the input layer attention mechanism for inputxfeatures to choose appropriate input features and get the new time series, then theas the input for encoder‐decoder network to predict. At the same time, we use standard gradient algorithm to update the weight and bias of the model iteratively and the model will stop after the specified number of iteration or when the error no longer changes.
In this part, the frequency alignment in the MIDAS model is used to transform the high‐frequency variables into observation vectors at low frequency instead of simply adding or
In the following, an example is given to illustrate expression of frequency alignment generally. Supposeytis the low‐frequency variable of quarterly sampling, and we want to use the high‐frequency variablextof monthly sampling to forecast it. In this example, there are three months in each quarter, so the frequency differencemi=3.Assuming that the data of the current month and the previous month are explanatory,we can model the high‐frequency variableytof each quartert,thenytis the linear combination of the high‐frequency variablex3t,x3t_?1,x3t?2observed in quartertand the high‐ frequency variablex3(t?1),x3(t?1)?1,x3(t?1)?2and the low‐frequency variableyt?1observed in quartert?1, which is expressed in matrix form as shown here:
Through this operation, we convert the variables sampled at high frequency into the observation vector (x3t,…,x3t?5)Tat low frequencies.At this time,the high‐frequency variablexand the outputythave the same frequency. On this basis, we assumed thatztis a high‐frequency variable sampled weekly,and this high‐frequency variable is also added to the above model.Generally speaking,there are 12 weeks in a quarter,somi=12.Similarly,we use the data of current quarter and previous quarter to be explanatory,and again we modelytfor each quarter t.This means that for quartert,we modelytas a linear combination of variablesx3t,x3t?1,x3t?2andz12t,z12(t?1)?1, …,z12(t?1)?11observed in the quartert?1. The model in matrix form is shown as follows:
F I G U R E 3 Schematic diagram of the MID‐DualAtt model
Encoder‐decoder is a common framework in deep learning,which is widely used in machine translation, speech recognition and other natural language processing tasks. Traditional encoder‐decoder framework has a problem with the lost information when the input sequence is too long and the study found that human attention used to extract the feature which is divided into two levels [23]: the first stage selects the basic stimulus characteristics, and the second stage uses the classified information to decode the stimulus.Inspired by this, we adopt double stage attention mechanism to solve the information loss and select the relevant input features.
Encoder with input attention mechanism is introduced,which is mainly used to extract relevant input features.Given that our input sequence is the obtained mixed frequency data X = (x1, x2, …, xT) with xt∈Rn, wherenrepresents the number of input features. So the encoder can map xtto htat timet, ht=f1(ht?1, xt), ht?1represents the hidden state of encoder at timet,f1usually adopts the nonlinear activation function, and we can use the LSTM network unit or the GRU network unit. In the proposed method, we use LSTM unit as the nonlinear activation function for the model. LSTM unit has cell state ctat timet, which is controlled by three gate mechanisms with sigmoid functions: forget gate ft, input gate itand output gate ot. The LSTM unit can update each state according to the follow formulas:
where[ht?1;xt]is a concatenation of the previous hidden state ht?1and the current time input xt.Wf,Wi,Wo, Wcand bf, bi,bo, bcare hyper parameters to learn.σand ⊙a(bǔ)re a logistic sigmoid function and an element‐wise multiplication, respectively. The reason why we use the long‐ and short‐term memory network here is that its cellular structure can avoid the problem of gradient disappearance and gradient explosion in the RNN, and it can better obtain the long‐term dependent information.
where [dT; kT] is a concatenation of the decoder hidden state and the context vector.The linear function with weights vyand biasbvproduces the final prediction result.
Here, we use a practical application to test the MID‐DualAtt model proposed by the present study. First of all, the experimental data are introduced, the monthly inflation index is predicted by using monthly macroeconomic variables and daily financial market variables. More precisely, we compare it with other competitive models in the experiment, and then we find that the MID‐DualAtt model performs better in terms of predictive accuracy.
F I G U R E 4 The time series plot of inflation
Time series forecasting in financial field is a very popular research, including stock price index forecasting [24], stock price index volatility forecasting [25], output growth forecasting [26], GDP index forecasting and so on. In this case,we forecast China’s monthly inflation index, which plays an important role in guiding government personnel to formulate economic policies and the country’s macroeconomic control.In reality, there are many economic variables that affect the change of inflation index, such as per capita consumption level, stock price index, housing price index, foreign exchange tax rate, etc., all of which have a certain impact on the inflation index. Thus, we use these economic variables to predict the inflation index, specifically including daily economic variables such as Oil, ER, Stock, SHIBOR, as well as monthly observed indicators such as M2, IP, House, etc.These data can be used to forecast the inflation index from Juling financial database, Fred database and Shanghai interbank offered rate. In order to make the data consistent in time and space, we use the data from October 2006 to April 2018. There are 139 months of monthly data and 3058 days of daily data. For details of the data, refer to Figure 4, and draw the inflation curve from 2006 to 2018. Then, we use the following training methods: the dataset of the first 98 months from October 2006 to December 2014 as our training data set which is our in‐sample data, and the data from January 2015 to April 2018 as our test data set which is our out‐of‐sample data. We use the multi‐step advance prediction to calculate RMSE (root‐mean‐square error) and set the high frequency prediction rangehm= {m/m, 2m/m,3m/m}, then the corresponding low frequency prediction rangeh= {1, 2, 3}.
The method proposed by authors is ultimately a prediction model, which is smooth and differentiable, so mean square error can be used as the loss function,
F I G U R E 5 RMSE changes with time steps
whereNrepresents the number of training samples, In order to train the iteration speed of the model, the batch stochastic gradient descent algorithm and Adam optimizer [27] are usually used for training. In this experiment, we set the value of mini‐batch as 10. Setting the learning rate to 0.001 and reducing the learning rate by 10% for every 1000 training sessions will give the model a better iterative effect. Set the model to train a total of 10times or use the early stop technique to stop training when the loss value does not change.We used RMSE as the objective function, which is the commonly used evaluation in the prediction model [28].
Model setting is mainly the parameter selection of model network structure,one is the selection of time window size T,the other is the selection of the number of hidden neurons m/p encoding and decoding. We use the test set in the experiment for prediction, and draw Figure 5 based on the experimental results. Given the value of T and the value of m/p,draw the line graph of RMSE changing with the value of T and m/p. We used grid search method to select the time window size T in the network model,and tested the candidate parameter set {4,7,11,16,23}. Through the experiment, the mean square error changes with the time window size, as shown in Figure 6.From the figure,it can be seen that the time window can obtain the minimum RMSE value when T=11.Similarly,the grid search method was used to select the size of the encoding and decoding hidden neurons in the network model, and the test was conducted on a set of candidate parameters {2/3,2/5,2/7,5/5,7/5}. The experimental results as shown in the figure showed that the minimum RMSE value was obtained when the hidden neuron of the encoder was 2 and the hidden neuron of the decoder was 5.
F I G U R E 6 RMSE changes with number of hidden units
In order to predict the monthly inflation index of our output variables, we use the daily data variables: oil price, Sino US exchange rate,Shanghai Composite Index,Shanghai interbank lending rate, monthly data variables, M2 currency stock, industrial production index, and housing price as our input variables, Table 1 displays summary statistics for all data. In addition, in terms of model setting, we set the number of hidden neurons in the first layer to 2, the number of hidden neurons in the second layer to 5,we set the maximum number of iterations 108,or the iteration loss is no longer reduced,and set learning rate toη=0.001.According to the time series cross validation method (TSCV), we select the lag orderli. Eight high‐frequency lags and two low‐frequency lags are selected through experiments, as shown in Table 2.
In order to demonstrate the effectiveness of our model,we have the same competitive model adopted as our comparison method, including the traditional econometric method mixed data sampling model (MIDAS), parameters of constricted U‐MIDAS model, in addition to the traditional classical ANN. Since the ANN model requires variables with the same observed frequency, we operate a simple averaging on daily predictors and generate the corresponding monthly data. The competitive method has also the ANN‐U‐MIDAS model, newly published one in expert system with application in 2019, which combines traditional artificial neural network with mixed data sampling [17]. By comparing the above five methods, we found that the proposed model is able to explore the temporal features of mixed frequency data,especially select the most relevant input features, which improve the prediction accuracy. The RMSE of multi‐step advance prediction in the sample decreased by 0.063, 0.206 and 0.021, respectively, while the mean square error of multi‐step advanced prediction out of sample decreased by 0.014,0.04 and 0.021, respectively, as shown in Table 3 and we observe that MID‐DualAtt generally fits the ground truth much better than other models. This suggests that adaptively extracting driving series can provide more reliable input features to make accurate prediction.
We use matlab and python hybrid programming, and use the package code in matlab to achieve frequency alignment of mixed frequency data. Specifically, in the experiment, the variables that need frequency alignment include the variables observed every day: oil price, foreign exchange rate,Shanghai composite index, and Shanghai interbank offered rate. Frequency alignment for high‐frequency variable into substance under the low frequency of observation vector,only related with the lag ofLiorder has nothing to do with the process of frequency, the encoder‐decoder framework can not only adaptively select the most relevant input features, but can also capture the long‐term temporal dependencies of a time series appropriately. The python version uses python3.6, the compiler uses PyCharm, and the neural network model uses TensorFlow deep learning framework for coding. The computer is configured with an Intel i5‐8500 CPU with a stable frequency of 3.0 GHz and the memory (RAM) size of 16.0 GB.
TA B L E 1 Description and summary of the dataset
TA B L E 2 The optimal values of parameters selected by TSCV
TA B L E 3 Results of RMSE both in‐sample and out‐of‐sample
The authors addressed the prediction problem in mixed frequency data. The traditional methods only consider the non‐linear characteristics of the variables in the processing of mixed frequency data by using neural network, but they do not consider the temporal characteristics of mixed frequency data and the extraction of input characteristics. Thus, we creatively adopted the attention mechanism to process mixed frequency data in this work. First, we use the MIDAS model to achieves frequency alignment and then input the data into the neural network.Next,we used an encoder‐decoder network based on the two‐layer attention mechanism, which includes encoder with an input attention mechanism and decoder with a temporal attention mechanism. The attention mechanism can adaptively select the most relevant features of the input features and capture the long‐term time dependence. For practical purposes, an interesting issue concerning inflation’s determinants on forecasts is investigated by using both monthly macroeconomic variables and daily financial market variables. The experimental results have shown that the proposed model outperforms other methods on prediction accuracy of mixed frequency data.Despite several advantages of the MID‐DualAtt model, there are some limitations that should be addressed in the future. First, like other traditional neural network models, MID‐DualAtt will also be affected by overfitting. Second, without using real‐time external information, such as expert experience, news reports, and financial policies, the model cannot make use of those information which may have a certain impact on the prediction accuracy.
In the future, this study can be further promoted in two directions. First, we can apply a parameter decay penalty or regularisation term in the objective to avoid overfitting for further research. Second, we can transform the extra information into a guide vector and combine it with the model to aid in prediction, so that the model has the ability to adjust and predict in time according to changes in external information. We rather suggest combining the traditional econometric model with deep learning, as it is an efficient tool to handle mixed frequency data and has broad application prospects.
ACKNOWLEDGMENTS
This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 61876027,61533020 and 61751312 and the key T&A program of Chongqing under grant No. cstc2019jscx‐mbdxX0048.
CAAI Transactions on Intelligence Technology2021年3期