INTRODUCTION
Hand, foot and mouth disease (HFMD) is a cluster of manifestations with fever and exanthema of the hands, feet and mouth caused by human enteroviruses. It commonly occurs in children aged <5 years throughout the world [Reference Seong1–Reference Ji3]. In China, HFMD had been a nationally notifiable disease since 2008, and cases must be reported to the National Disease Surveillance Reporting and Management System within 24 h. Since 2010, HFMD has become the infectious disease with the largest number of cases in Sichuan province, China. More than 40 000 cases were reported annually with an incidence rate of over 50/100 000 population, causing a serious social and economic burden.
Epidemic forecasting models were regarded as important tools to predict the occurrence of infectious diseases and formulate reasonable short-term or long-term precautions. Certain factors such as speed of pathogen variation, accumulation of susceptible hosts and environmental change allow infectious diseases to be modelled [Reference Guan, Huang and Zhou4]. More statistical methods have been used for incidence prediction of infectious diseases, including linear regression, correlation coefficient analysis, grey swing models and back propagation artificial neural network models [Reference Wang5–Reference Clement8]. In addition to these, autoregressive integrated moving average (ARIMA) models, which consider changing trends, periodic changes and random disturbances, have been widely used in modelling the temporal dependence structure of time series.
Our study developed a stochastic ARIMA model and then forecast the HFMD incidence in Sichuan province. To the best of our knowledge, this study is the first to apply an ARIMA model to fit and predict HFMD incidence in this area, whereas only Huang et al. used ARIMA to predict Chinese HFMD incidence prior to our research [Reference Huang9].
MATERIALS AND METHODS
Data collection
The observed monthly cases of HFMD were extracted from the Chinese National Disease Surveillance Reporting and Management System from January 2010 to June 2014.
Method
The ARIMA model was designed to take advantage of the associations in the sequentially lagged relationships that exist in periodically collected data [Reference Akhtar and Rozi10]. Autoregressive (AR) specifies that the output variable depends linearly on its previous values, while moving average (MA) is the linear regression of the current value of the series against current and previous terms. If the raw data showed evidence of non-stationarity, a differencing step (the integrated part of the model) should be applied to remove it.
There are three parameters in the ARIMA model: p, d and q, which refer to the order of the autoregressive, integrated and moving average parts of the model, respectively. As the data showed evidence of seasonal tendency, the general multiplicative seasonal model [ARIMA (p,d,q) × (P,D,Q) s )] was used. The parameter P indicates the order of seasonal autoregression; D, the degree of seasonal difference; and Q, the order of seasonal moving average [Reference Wentong11].
The process of predicting incidence by the ARIMA model consists of four steps. First, the original time series should be transformed by square root or logarithmic algorithm, difference or seasonal difference to induce stationarity. Second, the model identification used autocorrelation function (ACF) and partial autocorrelation function (PACF) analyses to analyse the random, stationary and seasonal effects on the time-series data. Third, the most appropriate model would be selected using parameter estimation and model testing. Diagnostic checking parameters, which include the coefficient of determination (R 2), normalized Bayesian Information Criterion (BIC) and mean absolute percentage of error (MAPE) were used to compare the goodness-of-fit of the models and determine statistical significance (P < 0·05). The optimum model should have the highest R 2 and the lowest MAPE and BIC. The Ljung–Box test was used to diagnose whether the residual error sequence was a white-noise sequence. Finally, predictive analysis could be conducted with the optimum parameter combination [Reference Chatfield12–Reference Bowerman and O'Connell14].
The data were analysed using the appropriate module in SPSS version 19.0 (SPSS Inc., USA) and R version 3.1.2 (R Foundation, Austria).
RESULTS
Spatio-temporal pattern
The monthly HFMD incidence from January 2010 to June 2014 in Sichuan province was between 280 and 10 929, with an incidence rate between 0·3482 and 13·3525/100 000 population.
In terms of the temporal distribution of the HFMD cases, Figure 1 shows that the monthly incidence exhibited obvious seasonal fluctuation, with peaks between May and July.
Figure 2 shows the spatial distribution of the annual incidence rate of HFMD cases at the municipal level. High-prevalence areas included the provincial capital Chengdu and its surrounding cities, as well as the northeastern and southwestern areas of Sichuan province.
Model identification
Time-series data from January 2010 to June 2014 was used as the training set. Given that the raw data had a non-stationary variance, it was converted into a square root to reduce the variance.
Figure 3 shows that the ACF and PACF of the original data had a non-stationary variance, and the effects of linear and seasonal trends were eliminated by taking the difference and seasonal difference. Figures 4 and 5 show that the ACF and PACF of the new data tended to be stationary after using the first-order seasonal difference, which determined that d and D in the ARIMA (p,d,q) × (P,D,Q)12 have values of 0 and 1, respectively.
Model diagnosis
Based on the distribution characteristics, several model algorithms with the parameters p, q, P and Q with non-negative integer values from 0 to 3 were used. Table 1 shows the parameter estimation for plausible ARIMA models. Seven models were found to have statistically significant parameters.
AR, Autoregressive; MA, moving average.
Based on the results of the goodness-of-fit test statistics, we confirmed the optimal ARIMA (1,0,1) × (0,1,0)12 model, which had the highest R 2 (0·692), lowest BIC (15·982) and relatively low MAPE (5·265) among the seven plausible models (Table 2). The Ljung–Box test also showed that its residual was white noise with Q = 9·456 (P = 0·893), indicating that the fitted data series was stationary, random and zero-related. Figure 6 shows that the ACF and PACF of the residual sequence fell within the random confidence interval.
MAPE, Mean absolute percentage of error; BIC, Bayesian Information Criterion.
Furthermore, with the existence of seasonal trends, we conducted a sine function to fit the same series and obtained two equations:
From the results of the goodness-of-fit, the BIC values of the sine function were significantly larger than those of the ARIMA model.
Forecast and analysis
The ARIMA (1,0,1) × (0,1,0)12 model was applied to forecast HFMD incidence, and its predicted data and 95% confidence limits from July to December 2014 are shown in Table 3 and Figure 7. The predicted data matched the actual data well except for August, but the actual incidence in August still fell within the predicted 95% confidence interval. The average relative error was 15·31, which was also less than that of the other six models (15·52–28·52).
LCL, Lower confidence limit; UCL, upper confidence limit.
DISCUSSION
Since 2008, the incidence of HFMD has continued to rise in China, causing widespread social concern and fear [Reference Chang15]. Hence, exploring effective prediction methods of HFMD has great practical significance. Time-series prediction is very helpful in developing hypotheses to explain and anticipate the dynamics of the observed and subsequent disease status because it is based on the changes in historical datasets over time and produces mathematical models that can be extrapolated [Reference Kuhn, Davidson and Durkin16, Reference Liang17]. The ARIMA model has been used widely in medical research as it provides a comprehensive model in time-series analysis [Reference Silawan18–Reference Mumbare20]. Through the ARIMA procedure, we can directly foresee the future incidence trend and develop niche targeting preventive and control measures. In addition, the 95% confidence interval range of the predictive value can also be used for early warning of infectious diseases. An infection during a specific time period that is over the upper limit of the confidence interval may be a warning that a certain cluster or outbreak may occur.
Several studies have used ARIMA models to fit and predict the changing trends of infectious diseases and achieved good results [Reference Earnest21, Reference Li22]. In our study, we obtained a multiplicative ARIMA model to eliminate the influence of seasonal tendency. A seasonal module ensured that the ARIMA model provided the best forecast possible. Based on the testing results, the conducted model (1,0,1) × (0,1,0)12 was reliable with high validity and can be used to forecast the expected numbers of cases. The forecast results also matched the actual data well.
However, few predictions would occasionally have a significant difference from the actual value even by the best-fit model, such as the forecast in August 2014, which was obviously higher than the actual level in our study. Many natural, social and environment factors could affect the incidence of HMFD, including the pathogen, environment and host factor [Reference Hii, Rockl and Ng23–Reference Zou25]. More than 30 different enteroviruses can cause HFMD; the most common pathogens being EV71 and CoxA16. The capacity of each pathogen to influence disease transmission varies. In Sichuan province, major pathogens presented alternative states in different periods, causing incidence fluctuations. The behaviour of susceptible populations could exacerbate transmission risk, e.g. not washing the hands often or contact with patients and their items. Many studies mentioned that temperature, humidity and other weather conditions could also influence the spread of HFMD. High temperature and relative humidity are generally considered to increase the incidence [Reference Onozuka and Hashizume26–Reference Feng, Yu and Hu28]. In addition, Chinese schools have two vacations annually: the summer vacation from January to February and the winter vacation from July to September. During these times, children usually stay at home, with few opportunities for contact with one another; thus, the incidence of HFMD may fall.
Data quality is also significant. First, in order to conduct an effective ARIMA model, the time sequence must be sufficiently long, and observations should be added continually to the sequence over time. Consequently, new parameter combinations of the ARIMA model might be refitted to the new series. Second, missing reports of the Chinese National Disease Surveillance Reporting and Management System also affected the results of the forecast. According to our survey of infectious disease reporting quality in Sichuan province, the missing rates of HFMD were between 1·7% and 8·5% from 2010 to 2014. When evaluating the disease trend, missing reports should also be taken into account.
DECLARATION OF INTEREST
None.