BACKGROUND
Hand, foot, and mouth disease (HFMD) is an infectious disease triggered by an enterovirus. Its victims are mainly children under five years old. HFMD, which is mainly prevalent in East and Southeast Asia, infects many people in such countries as China, Malaysia, Japan, and Taiwan each year, where it is a serious public health problem [Reference Liu1]. According to a report from the Department of Disease Control and Prevention of the Chinese National Health and Family Planning Commission (CNHFPC), from January 2011 to December 2015, there were 1 05 27 500 HFMD infections in China, and 1967 deaths. HFMD has caused great suffering to afflicted children and their families. If rapid and low-cost disease surveillance and prediction were possible, the department of public health could institute early prevention tactics in places where children congregate (e.g. kindergartens) [Reference Wei2].
Prediction of HFMD epidemics in China has aroused wide interest, and various predictive models have been constructed. These include a dynamic model [Reference Li, Zhang and Zhang3], an autologistic regression model [Reference Bo4], a gray system GM (1,1) model [Reference Wang5, Reference Pan6], a neural network model [Reference Zhang7] and an auto-regressive integrated moving average (ARIMA) model [Reference Liu1, Reference Pan6, Reference Li, Li and Gu8–Reference Huang11]. Pan et al. found the ARIMA model preferable to the GM (1,1) gray system model in predicting HFMD by comparing their outputs [Reference Pan6]. More research on predicting China's HFMD epidemic has been based on the ARIMA model than on any other model. However, internet data have rarely been used for surveillance and prediction of HFMD outbreaks in China. Tracking and predicting infectious diseases using query data from internet searches has the advantages over other methods of high speed and low cost [Reference Ginsberg12, Reference Yuan13].
Previous research has shown that search engine query data can predict such epidemics as seasonal flu [Reference Ginsberg12, Reference Hulth, Rydevik and Linde14], human immunodeficiency virus (HIV) [Reference Jena15], rotavirus vaccination (RV) [Reference Desai16], West Nile virus (WNV) [Reference Desai16], respiratory syncytial virus (RSV) [Reference Carneiro and Mylonakis17] and methicillin-resistant staphylococcus aureus (MRSA) [Reference Dukic, David and Lauderdale18]. However, research in Australia carried out by Page et al. found that network queries did not help to predict suicide [Reference Page, Chang and Gunnell19]. Thus not all diseases can be monitored or tracked using internet search query data. Furthermore, the studies mentioned above are based on query data from Google or Yahoo, which are relevant in the context of English culture. However, it remains to be tested whether similar conclusions apply to other cultures, contexts, or search engines.
Baidu is the largest Chinese search engine and commands a marked lead among search engines with a 55% market share in China [20]. It should be noted that the epidemic outbreaks of influenza and erythromelalgia (EM) in China were successfully predicted by Baidu query data [Reference Yuan13, Reference Gu21]. This suggests that internet query data can be used to predict some infectious diseases in the Chinese context. However, in previous studies some predictive models took Baidu queries as the only independent variable [Reference Gu21], while the others added historical disease cases as another independent variable [Reference Yuan13, Reference Huang22]. Whether it is necessary to add historical disease cases needs to be tested.
For prediction of infectious disease epidemics in a given geographic region using internet query data, the search engine that owns the largest market share in the area should be chosen, so as to guarantee the representativeness of the data. Baidu is the most popular search engine in China and is preferred by 86·7% of internet users [Reference Gu21]. As estimated by Tech in Asia, compared with 3·3 billion daily searches of Google, Baidu's daily search volume reaches around 5 billion [23].
A recent study by Huang et al. [Reference Huang22] has estimated HFMD prevalence in Guangdong province, China, with Baidu search data. Taking into consideration spatial inconsistency between disease cases and internet queries in different areas, which might create a bias for epidemic prediction using search engine data, Huang et al. evaluated Baidu search data with the biased sentinel hospital-based area disease estimation (B-SHADE) model [Reference Wang24] by sampling those subspaces with a high correlation coefficient between HFMD cases and Baidu queries [Reference Huang22]. However, whether a revision of the Baidu index is necessary for HFMD prediction with big internet datasets is still not clear. In addition, no study has yet focused on the HFMD epidemic at the national level in China. Here we use Baidu queries to analyze whether online behaviors can predict HFMD outbreaks in China.
DATA AND METHODS
Query keywords
The query data chosen to monitor infectious diseases depend on the key words used to filter search records. A typical method directly generates key word combinations from the symptoms of diseases. Zeng and Wagner propose that patients' psychological status can be divided into four stages: the perception of symptoms, the explanation of symptoms, the expression of perception, and the search for solutions [Reference Zeng and Wagner25]. Patients (or their family members) at the second or third phase who seek online medical aid usually set key words through the description of symptoms [Reference Zeng and Wagner25], so that epidemics can be predicted by referring to the frequency of key words relevant to symptoms [Reference Ginsberg12, Reference Hulth, Rydevik and Linde14, Reference Carneiro and Mylonakis17, Reference Gu21]. For instance, Ginsberg et al. selected 45 key words related to symptoms of influenza-like illness (ILI), which they used to search Google records and detect the epidemic in the USA [Reference Ginsberg12].
Disease symptoms are not the only choice for key words to filter search queries in epidemic detection. Yuan et al. [Reference Yuan13] successfully predicted flu epidemics by Baidu queries with eight key words, including ‘prevent influenza’, ‘influenza symptoms’, ‘type A influenza vaccine’, ‘flu symptom’, ‘flu epidemic’, ‘influenza virus’, ‘influenza pandemic or type A influenza’ (in Chinese). Interestingly, these words use generic nouns but not ILI symptoms. Huang et al. [Reference Huang22] used the same method to choose key words for the prediction of HFMD epidemics from Baidu queries, but extended the number of key words to 11. Hulth et al. [Reference Hulth, Rydevik and Linde14] detected an influenza outbreak in Sweden by counting 20 types of web queries that contained ‘influenza’ or symptoms of ILI (in Swedish). Here not only ILI symptoms but also the name of the infectious disease were used as key words to filter search queries.
Prediction of epidemics can be achieved in various ways. A number of researchers choose the names of diseases as key words. For instance, Polgreen et al. [Reference Polgreen26] used the key words ‘flu’ and ‘influenza’ to predict flu epidemics in USA; Dukic et al. [Reference Dukic, David and Lauderdale18] predicted MRSA hospitalization rates through quarterly variation in Google in the USA by searching for ‘MRSA’ and ‘staph’; Jena et al. also reported that in the USA the annual incidence of HIV diagnosis is highly correlated with the frequency of Google searches for ‘HIV’ [Reference Jena15].
Actually, which psychological action stage that patients belong to is not as clear as Zeng and Wagner [Reference Zeng and Wagner25] claimed. Seeking online medical aid might occur at any time during the whole duration of an infectious disease. When afflicted by some common illnesses, people often search for relevant information with the names of diseases as key words. Moreover, in real life, although many patients (or their family members) know about a disease, after a doctors' diagnosis they still seek help through the internet. Thus using the name of the infectious disease as a key word might be a simple but effective way to filter patients' (or their family members') search queries. As HFMD is a common infectious disease, the present study uses ‘手足口病’ (HFMD in Chinese) as the only key word to detect and collect relevant queries of Baidu in China.
Data
Two types of data are needed: one is China's HFMD case numbers, and the other is Baidu queries for the entire country. We collected the two kinds of data for the 60-month period from January 2011 to December 2015. The count of HFMD cases is publicly available through ‘the report on nationally notifiable infectious diseases’ delivered monthly by the CNHFPC, whose website is http://www.moh.gov.cn/jkj/index.shtml. These data are monthly aggregated cases. The query data concerning HFMD were obtained through the ‘Baidu index’, a sharing platform of big data, whose website is http://index.baidu.com. The name of the disease was the only key word we used to analyze the queries in the Baidu search. As Baidu query data are available on a daily basis, the average value over a given month was treated as the monthly count for that month.
Methods
Since there may be spatial heterogeneity in Baidu searching and HFMD cases due to China's vast diversity, we used Moran's I [Reference Moran27], Gi [Reference Getis and Ord28] and Q-statistic [Reference Wang, Zhang and Fu29] to test for spatial autocorrelation, spatial local heterogeneity, and spatial stratified heterogeneity. Results indicate that both Baidu queries and HFMD cases are spatially heterogeneous (data not presented here but available on request). However, the spatial distributions of HFMD cases and Baidu queries are highly correlated and exhibit similar trends over years, which implies that the potential impact of spatial heterogeneity on the relationship between Baidu searching behaviors and HFMD cases is small.
To further test the potential impact of spatial inconsistency between HFMD cases and Baidu queries in subareas, two kinds of models have been constructed: one deals with potential bias between HFMD cases and Baidu queries by re-evaluating the Baidu index with the B-SHADE model as was done by Huang et al. [Reference Huang22]; the other model ignores such diversities. We carried out the test under two conditions: first, HFMD epidemics in China in 2014 and 2015 (dependent variable) were predicted by the only independent variable, namely the Baidu index or the revised Baidu index separately. Second, we added historical HFMD cases into the predictive models as the other independent variable. Mean absolute percentage errors (MAPE) was used to evaluate predictive accuracy of different models.
As Table 1 shows, under the first condition the model that did not re-evaluate the Baidu index with the B-SHADE model had better predictive accuracy. Under the second condition the predictive accuracy was very high. It seems that handling spatial inconsistency between HFMD cases and Baidu queries in sub-areas did not significantly improve the predictive accuracy. Re-evaluating the Baidu index is not helpful for HFMD prediction at national level in China. On the other hand, after adding historical HFMD cases to the predictive model, MAPE in 2014 was significantly reduced, but MAPE in 2015 increased a little. Thus adding historical disease cases might not generally increase accuracy of prediction, but the variation in MAPE is decreased, which could produce more stable predictions.
Note: Added, adding historical HFMD cases as an independent variable in the predictive model; Not added, remains unchanged; Handled, spatial inconsistency between HFMD cases and Baidu queries in subareas are handled and re-evaluated by the B-SHADE model; Not handled, remains unchanged.
We used log-linear regression to construct a predictive model of HFMD epidemics, with the number of HFMD cases as the dependent variable, and the number of Baidu queries and the number of historical HFMD cases as the independent variables. The model we use is presented below:
where y t represents number of cases at t time, y t−1 the number of cases at t–1 time, χ t query number at t time, α, β 1, and β 2 are coefficients to be estimated, and ε is the residual error.
Since epidemics are dynamic, whether a prediction at a given time is correct or not may depend on the time. As time goes by, if the predictive curves at different stages fit the actual case number, the conclusion can be more persuasive than fit only at a single predictive stage. The predictive model of big data proposed by Ginsberg et al. adopts the strategy of analyzing and predicting in stages [Reference Ginsberg12]. Following that model, we used our model in equation (1) at three periods to predict the outbreak of China's HFMD epidemics in stages.
During a period of 60 months, the queries of Baidu at different stages were used to track and predict HFMD epidemics in China. In the first phase, data for the first 24 months (from January 2011 to December 2012) were used to build a model to predict the HFMD epidemic from the 25th to the 36th month. In the second phase, data for the first 36 months (from January 2011 to December 2013) were used to predict the HFMD epidemic from the 37th to the 48th month. For the third phase, data for the first 48 months (from January 2011 to December 2014) were used to predict the HFMD epidemic from the 49th to the 60th month.
A comparative method was used to analyze prediction of the HFMD epidemic based on the Baidu queries. First, international predictive models that predict all kinds of epidemics by query data from search engines were compared. The fits of different models, especially using the correlation coefficient R, were estimated. The greater the value of R, the better the prediction; an F-test was used to test for significance level. We compared the different methods according to the prediction of HFMD epidemics in China. For different models, comparison of their mean relative errors between HFMD's predictive value and the actual case number (or incidence of the disease) is pointless. Therefore, the predictive-effect indicators of different models were further transformed into MAPE, and the indicators were compared. The smaller the absolute mean relative error, the better the prediction. All the statistical analysis was carried out using the software SPSS 19.
PREDICTIVE MODELS AND CONCLUSIONS
Our predictive model of China's HFMD epidemic applies Formula (1) in three stages.
Stage 1
By fitting HFMD cases to Baidu query data from the first 24 months, we obtain the predictive model shown in equation (2):
R = 0·954, adj.R 2 = 0·901, F = 105·460 (P < 0·001).
Stage 2
By fitting HFMD cases to Baidu query data from the first 36 months, we obtain the predictive model shown in equation (3):
R = 0·950, adj.R 2 = 0·896, F = 152·107 (P < 0·001).
Stage 3
By fitting HFMD cases to Baidu query data from the first 48 months, we obtain the predictive model shown in equation (4):
R = 0·956, adj.R 2 = 0·910, F = 238·056 (P < 0·001).
Analysis of the model in three stages shows that there is a strong correlation between the number of Baidu queries and the number of China's HFMD cases. R, the correlation with the predictive model, varies between 0·950 and 0·956; adjusted R 2 is between 0·896 and 0·910; the F-test is highly significant (P < 0·001). Thus our models fit very well. These three models were applied to predict HFMD epidemics in the 12 months subsequent to the dates of the data used to estimate the parameters.
Comparisons between HFMDs predicted and actual numbers are shown in Figures 1 b, d, and f. The Pearson correlation coefficient between the predicted values and actual cases in the three stages (R = 0·950; P < 0·001) indicates that the predictive ability of our models is very strong. Figure 1 shows that the prevalence curves of Baidu queries and HFMD cases move in the same direction, and every 12 months (each year) there is a clear peak that represents that year's flashpoint for HFMD. Thus, months such as the 6th (June 2011), the 17th (May 2012), the 30th (June 2013), the 41st (May 2014), and the 54th (June 2015) show that HFMD has a periodic peak, typical of seasonal infectious disease. This finding is consistent with the properties of HFMD's epidemiological features proposed by Xing et al. [Reference Xing30] and Hu et al. [Reference Hu31]. Clearly, Baidu search query data can be used to track and predict the epidemic of HFMD quite well in China. These findings can be used to provide warnings before outbreaks and thus prevent or reduce the risks of infection.
Table 2 compares different models for epidemic prediction by internet queries. In our study, the average correlation coefficient R in the three models is 0·93, which is significant at 0·1% level. This is better than those in models using Google, Yahoo, or Baidu queries to predict flu [Reference Ginsberg12, Reference Zeng and Wagner25], HIV [Reference Jena15], RV [Reference Desai16], HFMD [Reference Huang22], and is similar to those using Baidu or Google queries to predict EM [Reference Gu21] and MRSA [Reference Dukic, David and Lauderdale18]. Thus, our method of predicting HFMD from Baidu search queries has high accuracy. Additionally, both common and rare infectious diseases could be successfully predicted, which suggests that internet queries are broadly applicable for tracking and predicting epidemics. The comparison in Table 2 also reveals that the predictive effect of Baidu queries in Chinese is better than that of Google and Yahoo in English. Unlike Chinese, English is rich in compound words and easy to misspell. Also Baidu has a higher share of the Chinese search engine market than Google and Yahoo have in the USA.
Note: R is the correlation coefficient between predicted values of the model and disease case counts; P value, significance level. If R (or R 2) of predictive models is obtained more than once, we only give the average value.
A comparison of MAPE between the number of HFMD cases and the predictive value of different methods was made to analyze the prediction of HFMD epidemics using internet query in China. Previous predictive studies, which adopted methods such as dynamic modeling [Reference Li, Zhang and Zhang3], autologistic regression [Reference Bo4], gray system GM (1,1) [Reference Wang5, Reference Pan6], and neural networks [Reference Zhang7], did not report errors in the predicted value; therefore the predictive value of these different models cannot be compared using MAPE. Huang et al. [Reference Huang22] used MAPE to describe the difference of fitted results and the real HFMD data. But what we care about is the divergence between the predictive value and the actual case number. Fortunately, most predictive studies that use the ARIMA model provide MAPE of HFMD epidemic prediction. As Table 3 shows, the error in the predicted value varies between 0·15 and 1·3, and it is higher than 0·3 in most studies [Reference Liu1, Reference Li, Li and Gu8, Reference Hu10,Reference Huang11]. In the current study, which predicts the outbreaks of HFMD epidemics within 60 months, the absolute mean relative error is 0·28, indicating that our method has good predictive accuracy over a longer tracking and predictive period.
DISCUSSION
Infectious diseases, whose pathogens can spread in a variety of ways, may reach high prevalence over a vast region. Such diseases need to be monitored and predicted without delay. Internet search engines provide disease information or medical consultation online, and caregivers of patients may search using relevant themes when disease occurs. By analyzing the address of IP or hardware, query behaviors can be counted and located geographically. Therefore, the prevalence status of epidemics can be detected by analysis of online behavior data [Reference Ginsberg12]. This is very important in China, which is densely populated and has many seasonal and regional infectious diseases.
The current study, using Baidu queries in the Chinese context, manages to track and predict outbreaks of HFMD epidemics. Although there is spatially stratified heterogeneity in Baidu searching behavior and HFMD cases due to regional differences across China, our study shows that the spatially stratified heterogeneity has little negative effect on the relationship between HFMD cases and Baidu queries. Also, spatial inconsistency between HFMD cases and Baidu queries at the provincial level has little effect on prediction of HFMD epidemics with internet searching data at the national level. When detecting the outbreak of an epidemic with internet engine data, historical disease cases should be embedded into the predictive model, which might make predictive outputs more stable.
How to choose the key words to filter search records is a central problem in predicting epidemics with internet queries. Even for the same infectious disease (e.g. flu), different researchers have adopted different key words [Reference Ginsberg12–Reference Hulth, Rydevik and Linde14, Reference Zeng and Wagner25]. The unfamiliarity of the public with a disease might pose a challenge to using internet search queries [Reference Dukic, David and Lauderdale18]. For example, in the study on RV prediction, Desai et al. [Reference Desai16] used not only the correct term ‘rotavirus’ to filter internet queries, but also common misspellings such as ‘rota virus,’ ‘rotovirus,’ ‘roto virus,’ ‘rodo virus,’ and ‘rhoda virus’ [Reference Desai16]. Users may enter different terms for the same disease, depending on their level of education and their cultural and language backgrounds [Reference Carneiro and Mylonakis17]. Thus too many key words may not be a good choice for epidemic prediction by internet queries. In our study, the name of the infectious disease was used as the only key word, and this method appears to work simply but effectively for HFMD prediction.
From 2002 to 2003, there was an outbreak of severe acute respiratory syndrome (SARS) in China, which lasted for 8 months (from 16 November 2002 to 14 July 2003). The disease caused 7435 infections and 646 deaths, covering 32 regions (provinces, municipalities, autonomous regions, and special administrative zones) in China [Reference Fan and Ying32]. Important reasons for the panicked response of the Chinese public health department to that public health emergency were inaccurate estimation at the early stages and delayed reporting of the epidemic situation. Therefore, in 2004 the government of China therefore adopted a policy of reporting, notifying, and announcing infectious disease epidemics. However, there is still much room for improvement of the system. On the one hand, reporting and aggregation of disease cases takes a lot of time, although the CNHFPC notifies the epidemic status of nationally notifiable infectious diseases every month. Taking the year 2015 as an example, the time to notify averaged 11·5 days later than the last day of the month in which the diseases occurred. On the other hand, information about infectious diseases in smaller regions cannot be obtained, since the notification data concerning infectious diseases refer to the whole nation.
Internet queries are a precious resource that can be used to detect epidemics and should receive the attention of public health departments [Reference Ginsberg12]. The official collection and announcement of epidemics have certain organizational procedures that result in low disclosure efficiency. In the USA, the prediction of seasonal flu by internet search is 1–2 weeks faster than the official announcement [Reference Ginsberg12]; in China, prediction of HFMD epidemics by Baidu queries is about 10 days faster than official disclosure, and use of the former could mitigate the low efficiency in the bureaucratic hierarchy, as well as track and predict more flexibly in space and time at lower surveillance cost. It is suggested that the Chinese government should develop a supplementary system for HFMD prediction using Baidu queries, to augment the traditional surveillance system.
ACKNOWLEDGEMENTS
This work is supported by National Natural Science Foundation of China(Grant No. 71573202) and Natural Science Foundation of Shaanxi Province (Grant No. 2015JM7365).
DECLARATION OF INTEREST
None.