Chuwei Liu, Jianping Huang , Fei Ji, Li Zhang, Xiaoyue Liu, Yun Wei, Xinbo Lian
Collaborative Innovation Center for Western Ecological Safety, Lanzhou University, Lanzhou, China
Keywords:COVID-19 prediction hybrid EEMDARMA method historical data
A B S T R A C T In 2020, the COVID-19 pandemic spreads rapidly around the world. To accurately predict the number of daily new cases in each country, Lanzhou University has established the Global Prediction System of the COVID-19 Pandemic (GPCP). In this article, the authors use the ensemble empirical mode decomposition (EEMD) model and autoregressive moving average (ARMA) model to improve the prediction results of GPCP. In addition, the authors also conduct direct predictions for those countries with a small number of confirmed cases or are in the early stage of the disease, whose development trends of the pandemic do not fully comply with the law of infectious diseases and cannot be predicted by the GPCP model. Judging from the results, the absolute values of the relative errors of predictions in countries such as Cuba have been reduced significantly and their prediction trends are closer to the real situations through the method mentioned above to revise the prediction results out of GPCP. For countries such as El Salvador with a small number of cases, the absolute values of the relative errors of prediction become smaller. Therefore, this article concludes that this method is more effective for improving prediction results and direct prediction.
Coronavirus disease 2019 (COVID-19) is a novel infectious disease caused by a virus closely related to the SARS (severe acute respiratory syndrome) virus. COVID-19 has caused hundreds of thousands of deaths worldwide and was declared a global pandemic by the World Health Organization (WHO) on 11 March 2020 ( WHO, 2020a ). The COVID-19 pandemic has far-reaching consequences beyond the spread of the disease itself; it also has influence on quarantine measures, including political, cultural, and social implications. The WHO World Health Assembly made a global commitment to unite the world to fight COVID-19( WHO, 2020b ). The potential effects of COVID-19 have prompted extensive research to study the characteristics of the virus. Because the virus is new, it is challenging to predict when this disease will disappear.However, it has been found that about 60% of confirmed global COVID-19 cases have occurred in places with temperatures of 5 °C—15 °C. The pandemic spread to high latitudes in spring and summer, and countries located in mid-latitudes face the risk of a second wave of COVID-19 this autumn ( Huang et al., 2020a ). Therefore, short-term prediction is critical to better manage the societal, economical, cultural, and public health consequences of the pandemic ( Petropoulos and Makridakis, 2020 ), especially in high-risk countries.
Researchers worldwide have been predicting the development of the outbreak by using existing mathematical and statistical methods, including stochastic simulations, lognormal distribution ( Linton et al., 2020 ),machine learning, and artificial intelligence ( Tuli et al., 2020 ). The SEIR(susceptible, exposed, infectious, and removed) and SIR (susceptible, infectious, and removed) infectious disease models are the most widely used ( Wu et al., 2020 ; Wang et al., 2020 ; Yang et al., 2020 ). A global prediction system (Global Prediction System of the COVID-19 Pandemic;GPCP) based on the SIR model was recently developed ( Huang et al.,2020b ). The system determines the parameters through historical data fitting, which allows it to make targeted predictions for various countries and obtain better prediction results. However, the development of the epidemic is complicated, and there are differences between the prediction results of the GPCP system model and the real data; thus, the results need further revision.
Various methods have been used to revise prediction results. For example, the analogue—dynamical approach is used to revise weather forecast models ( Zheng et al., 2013 ; Yu et al., 2014 ). To modify the model results, it is necessary to analyze and predict the difference between the predicted results and the true values (forecast residuals). The residuals fitted by the model to historical data are nonstationary and nonlinear. In this study, we used the ensemble empirical mode decomposition (EEMD) method, which is an adaptive and temporal local data analysis method ( Wu et al., 2007 ; Wu and Huang, 2009 ). EEMD is a time series analysis method based on the empirical mode decomposition (EMD) method ( Huang et al., 1998 ; Huang and Wu, 2008 ), which decomposes complicated data series into finite quasi-periodic components at different frequencies and is suitable for adaptive analysis of nonlinear and nonstationary time series. The EMD/EEMD method has been used to analyze nonlinear and nonstationary data in climatic and oceanic analyses ( Wu et al., 2011 ; Ji et al., 2014 ; Chen et al., 2017 ) and for biomedical signal processing ( Colominas et al., 2014 ).
Methods to predict time series include support vector machines( Wang et al., 2010 ), artificial neural networks ( Jiang et al., 2003 ), and genetic programming ( Koutroumanidis et al., 2009 ). Box and Jenkins introduced a time series analysis approach called the autoregressive moving average (ARMA) method ( Box and Jenkins, 1976 ), which combines the advantages of the autoregressive (AR) model and the moving average (MA) model. The AR model quantifies the relationship between current data and previous data, and the MA model solves the problem of stochastically changing terms. The ARMA model has been applied to forecasting meteorological elements ( Torres et al., 2005 ) and macroeconomic evolution ( Anghelache et al., 2016 ). The model only needs time series data, so the residuals’ prediction of the infectious disease model can be better applied in it.
In this paper, we report upon work to improve the GPCP by applying the ARMA and EEMD methods to the results of the SIR model for the number of new cases in each country. We then use the method to predict the number of new cases in countries with fewer cases.
We used the cumulative number of cases from the COVID-19 Data Repository published by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University ( https://github.com/CSSEGISandData/COVID-19 ). The number of new cases is the difference between the cumulative number of cases on the current day and the cumulative number of cases on the previous day. The fitting and prediction data are from the GPCP system, and temperature in the model was ignored.
Based on the EMD method, the EEMD method has various improvements ( Wu and Huang, 2009 ). White noise is added to the original sequence, and the sequence is decomposed into a set of amplitude-frequency-modulated oscillatory components (intrinsic mode functions). These steps are repeated using a different white noise sequence each time, and the corresponding intrinsic mode functions are obtained as the final decomposition result. The detailed procedures can be found in previous studies ( Huang et al., 1998 ; Huang and Wu, 2008 ;Wu and Huang, 2009 ).
We first performed seven-point smoothing on the original residual sequence and EEMD decomposition on the smoothed sequence. The ratio of the additional noise to the standard deviation of the original sequence was set to 0.1 and repeated 100 times.
p
-order AR (p
) model, the current value of the time series is expressed as follows ( Box and Jenkins, 1976 ; Wang et al., 2015 ):q
-order MA (q
) model,q
previous values expressed as random errors, and the current value of the time series is expressed as follows:From the above, the ARMA model is expressed as follows:
y
is the predictive value,φ
is the correlation coefficient with each previous value,θ
is the correlation coefficient with the previous white noise,ε
is the white noise process with zero mean and variance,andε
is the previous noise term.From the perspective of fitting the real data, the residuals of the model fitting and the real data are nonstationary and nonlinear time series. The EEMD method can extract signals from such sequences and decompose them into different oscillatory components. The GPCP system parameters are obtained by fitting real data, and the parameters in the GPCP system are fitted from the real situation, so the initial model prediction mainly represents the trend which depends on human factors such as government policies. Therefore, the residuals basically reflect the oscillation of the infectious disease and are suitable for the EEMD method. The ARMA method’s prediction only depends on the time series,and no other information is needed. In addition, the increase in cases in countries with a small number of cases tends to show an increase in oscillation, a decrease in oscillation, or a stable oscillation within a certain range without obvious peaks, so the SIR model is not applicable. Owing to the advantages of the EEMD and ARMA methods, we propose a hybrid EEMD—ARMA method, which is motivated by the idea of “decomposition and ensemble ”( Yu et al., 2008 ; Guo et al., 2012 ).The seven-point smoothed original residual sequence is decomposed into several subsequences by the EEMD method. Each of subsequence is then predicted by the ARMA method, and the final predicted value can be obtained by summing the predicted values of each subsequences. The hybrid EEMD—ARMA method has a better effect on the high-frequency oscillations. Therefore, in our work, we used the hybrid EEMD—ARMA method to process sequences containing high-frequency oscillations. To improve the quality of the prediction, for the residual sequence, the first few days before the number of newly added cases reaches the peak are selected as the starting time of the sequence. For the predicted number of new cases series, the days around which the number of cases starts to fluctuate are selected as the starting time of the series. The procedures for the hybrid EEMD—ARMA predicting method are shown in Fig. 1 .
Fig. 1. Flowchart of the prediction of the residuals (add the dotted line when directly predicting the number of new people, and remove the dotted line when predicting the residuals).
Seven-point smoothing is performed on the residual sequence of the fitting result. The smoothed sequence is decomposed by the EEMD method, and the residual difference component is removed. The firstorder difference is calculated compared with other components, and then the ARMA model is used to predict each component. The prediction results of the appropriate components are selected for summation as the final residual prediction result.
For countries with a small number of cases, after the original sequence of newly added cases is decomposed by EEMD, the components are not removed and the ARMA predictions are made for each component directly, yielding the prediction results.
The relative error is calculated as
Fig. 2 shows the improvement of the prediction effect from 6 to 15 May 2020 by some countries using the hybrid EEMD—ARMA method to correct residuals. Judging from the relative error of the 10-day prediction before and after the revision, Italy is the one with the most obvious improvement. The relative error has improved from 83.23% to- 10.22%. Netherlands has the best prediction effect after correction;the relative error before correction is 35.65%, and the relative error after correction is reduced to - 0.07%. Using the GPCP system for direct prediction, only 15 of the 34 countries listed in Fig. 2 have a relative error with an absolute value of less than 40%. After correction, the number of countries with an absolute value of less than 40% has increased to 24. This method offers great improvements for prediction, and has the potential to be effective for future predictions.
Fig. 3 compares the prediction results of the number of newly increased people before and after the 10-day (6—15 May) correction in six countries (Cuba, Romania, Italy, Spain, Netherlands, and Sweden),and gives the respective relative errors. The peak number of new cases in these countries has already appeared, and some countries are in a steady state (Spain and Sweden). Some countries are in the stage of decline in the number of new cases (Cuba, Romania, Italy, and Netherlands).These countries experience better prediction effect after correction. The absolute values of the relative errors before and after the six countries’corrections decreased by 32.44%, 3.46%, 73.01%, 23.55%, 35.58%, and 21.78%. Judging from the revised results, the relative error of the six countries has been reduced, and the new development trend is more in line with the real situation.
Fig. 2. The relative errors of the hindcast results before and after correction from 6 to 15 May 2020 in some countries.
Fig. 3. Projections and relative errors before and after correction in six countries ((a) Cuba, (b) Romania, (c) Italy, (d) Spain, (e) Netherlands, and (f) Sweden) from the date of the emergence of confirmed cases to 15 May 2020. The reported confirmed cases (the date of the emergence of confirmed cases to 15 May) are shown as blue lines, while the historical simulated cases (the date of the emergence of confirmed cases to 5 May) are shown as red lines. The hindcast cases before revising(6—15 May) and after revising (6—15 May) are shown as orange and green lines, respectively. The two black dotted lines represent the time when the original residual sequence was used for correction (Tp) and the time when the prediction starts (Predict). The bar graphs show the relative errors of the results of projections before(orange bars) and after (green bars) correction.
Fig. 4 shows the prediction results and relative errors of the new cases based on the EEMD—ARMA method for 10 days (6—15 May) in 6 countries. Some of these countries have a relatively small number of new cases (Sri Lanka), and the accuracy of the GPCP system prediction is low. Some countries are still in the stage of rapid increase in the number of new cases, and there has not been a peak (El Salvador, Kuwait,South Africa, Sri Lanka, and Bolivia). The EEMD—ARMA method is used directly for prediction, and the relative error of the 10-day prediction in these six countries is less than 40%. This method performs well in El Salvador ( - 9.84%), Kuwait ( - 6.93%), South Africa ( - 0.09%), Sri Lanka(0.97%), and Bolivia (3.57%), where the fluctuation amplitude is small,and the prediction effect of Sudan ( - 38.21%), with a sudden increase in amplitude, is slightly worse. Overall, this method has better predicted the change range and trend of the new population in these countries.
Fig. 4. Projections and relative errors in six countries ((a) El Salvador, (b) Kuwait, (c) South Africa, (d) Sri Lanka, (e) Sudan, and (f) Bolivia) from the date of the emergence of confirmed cases to 15 May 2020. The reported confirmed cases (the date of the emergence of confirmed cases to 15 May) are shown as blue lines, and the hindcast cases (6—15 May) are shown as green lines, respectively. The two black dotted lines represent the time when the original residual sequence was used for correction (Tp) and the time when the prediction starts (Predict). The bar graphs show the relative errors of the results of projections.
COVID-19 has spread rapidly and severely affects human health and economic development worldwide. Therefore, it is paramount to accurately predict the development of the epidemic in various countries to provide data for relevant organizations. Overall, the SIR model provides a good prediction, but it has some limitations. For example, there are errors in predictions for countries that enter a decline in new cases after the peak, and the predictions for countries that have not yet reached the peak are less accurate during the increase in cases. To improve our understanding of the global impact of COVID-19 and to better predict the number of COVID-19 cases in different countries, we developed a hybrid EEMD—ARMA method to correct the results of GPCP and make direct predictions for countries with small numbers of daily new cases. Our method provides more accurate and reliable predictions of the spread of COVID-19, and we hope the method will eventually inform strategic government responses.
Based on our results, for cases that used the hybrid EEMD—ARMA method to make corrections and predictions, within 10 days of the back-prediction, the changes and trends in the number of new cases were closer to the actual situation. The relative errors were lower,and the prediction was better. Fighting the epidemic requires a concerted international effort, which we believe will eventually control the disease.
Declaration of Competing Interest
No potential conflict of interest was reported by the authors.
Funding
This work was jointly supported by the National Natural Science Foundation of China [grant numbers 41521004 and 41875083 ] and the Gansu Provincial Special Fund Project for Guiding Scientific and Technological Innovation and Development [grant number 2019ZX-06 ].
Acknowledgments
The authors acknowledge the CSSE at Johns Hopkins University for providing the COVID-19 data.
Atmospheric and Oceanic Science Letters2021年4期