Hepatitis E is a new zoonotic disease caused by the hepatitis E virus (HEV). The clinical manifestations of hepatitis E are similar to those of hepatitis A, such as fatigue, anorexia, and jaundice, but the severity of symptoms and mortality of hepatitis E are higher than those of hepatitis A [Reference Goel and Aggarwal1]. Humans are generally susceptible to HEV, but the virus mainly infects people aged 15–60 years. The mortality rate of the general population is from 1% to 3%, and the mortality rate of pregnant patients is from 5% to 25%. It can also cause neonatal hepatitis E or even death by vertical transmission [Reference Jin2].
Hepatitis E infection has a global distribution, but mainly in India, China, Pakistan, Mexico, and some other countries in Asia and Africa [Reference Kamar3]. In the past decade, the occasional outbreaks of hepatitis E are on the rise in some high-income countries. A growing number of local sporadic cases of hepatitis E that the route of infection cannot be determined are threatening human health [Reference Capai, Charrel and Falchi4]. According to estimates from the study on the global burden of hepatitis E, there are approximately 20 million hepatitis E infections each year, resulting in more than 3 million symptomatic hepatitis E cases, and 55 000 hepatitis E-related deaths, which makes it an important public health concern [Reference Blum5].
At present, there are mainly four genotypes of HEV. Genotypes 1 and 2 cause interpersonal outbreaks or epidemics, whereas genotypes 3 and 4 mainly infect several species of mammals (e.g. pigs, sheep, etc.), but at the same time, they also infect humans under certain conditions, causing sporadic hepatitis E [Reference Dalton and Izopet6, Reference Nelson, Labrique and Kmush7]. The hepatitis E virus is mainly transmitted by contaminated water and food through the fecal-oral route and it is verified that meteorological factors are related to the incidence of hepatitis E, as climate change will influence the environment, which may affect the quality of water and food [Reference Chen8–Reference Lake10].
In China, the areas with the highest incidence of hepatitis E are mainly in the northwest and the east, and Jiangsu Province is one of the areas [Reference Sun11]. Therefore, predicting the incidence of hepatitis E is quite indispensable. However, the existing disease surveillance information management system lacks an effective prediction and early warning mechanism. The establishment of a scientific, dependable, robust mathematical model can effectively solve this problem.
Currently, most researchers utilised the autoregressive integrated moving average (ARIMA) model to predict the incidence of hepatitis E [Reference Liu12, Reference Hu, Zu and Peng13]. However, the result might be unsatisfactory due to data linearity requirements. To tackle this problem, non-linear machine learning models, including support vector machine (SVM) [Reference Guo14], gradient boosting decision tree (GBDT) [Reference Peng15], back-propagation neural networks (BPNN) [Reference Ren16] and long short-term memory (LSTM) [Reference Guo14], are adopted to the prediction and early warning of hepatitis E. At present, the state-of-the-art model is LSTM used by Guo et al. [Reference Guo14]. This model can not only accurately capture the features of sequential data, but also effectively avoid the problems of vanishing gradients and exploding gradients on traditional recurrent neural networks. However, Guo et al. only use the past monthly incidence of hepatitis E to predict the incidence for the next month, and it cannot correct the current prediction with the input of the next time point. Besides, meteorological factors are not considered in their model [Reference Peng15]. These cause that the proposed LSTM model has much room for improvement.
Therefore, taking the above-mentioned problems, we propose a new Bi-LSTM model with various meteorological factors in this paper to predict the incidence of hepatitis E. Meanwhile, we compare our proposed model with existing models using ARIMA, SVM, LSTM and GBDT, aiming to provide the scientific basis for more effective hepatitis E incidence prediction, which also facilitates the development of the early warning system and the prevention strategies for hepatitis E in Jiangsu Province.
Method
Data source
The hepatitis E data used in this paper are collected from January 2005 to March 2017 by Jiangsu Provincial Center for Disease Control and Prevention. This dataset records the monthly incidence of hepatitis E in Jiangsu Province (P. R. China). Annual data of the demographic are obtained from Jiangsu Statistical Yearbook. The meteorological dataset is abstracted from the Jiangsu Meteorological Service Center, which contains the statistical data of 24 meteorological stations. We take the average value of the meteorological data observed by each station as the predicting value of monthly meteorological data.
The Bi-LSTM model
Model structure
As shown in Figure 1, four layers are constructed in the Bi-LSTM model, which are the Input Layer, Bi-LSTM Layer, Fully Connected Layer and Output Layer, respectively. The input of the model includes monthly feature vectors in the past, each of which is composed of monthly incidence of hepatitis E, monthly average temperature, monthly average water vapour pressure, etc. Such sequential vectors are entered into the Bi-LSTM Layer, which can optimise the prediction from both the previous input and the following data. This characteristic makes our model more robust and transferable. After the vector output by Bi-LSTM passes through the fully connected layer, the output result of the hepatitis E incidence rate of the current month can be obtained.
Model prediction
After preprocessing, we take the monthly incidence of hepatitis E and meteorological factors as monthly feature vectors. For each vector xt, we use a Min-Max-Scaler to normalise all dimensions. After setting the timestep T, we use previous T months' feature vectors to predict the incidence of hepatitis E for the current month. The Bi-LSTM Layer includes two steps. The input sequence is entered into the LSTM cells in the forward step, and after that, the reverse form of the input sequence is fed to other LSTM cells, which is called the backward step. $\vec {\boldsymbol h}_t$ and ${\vskip -7pt {{ \leftarrow}}}\hskip -5pt {\boldsymbol h}_t$ are used to represent the output in each step. The output of the Bi-LSTM layer is denoted as ht.
During the forward step, the input sequence is fed to the LSTM cells, each of which consists of three gates, as shown in Figure 2. The input gate generates a value it between 0 and 1 to determine how much new information needs to be retained. The forget gate generates a value ft between 0 and 1 to decide how much information should be neglected from the previous memory. With current input xt ∈ ℝN×1 and previous state $\vec{{\boldsymbol h}}_{t-1}$, we get the candidate for new information $\tilde{{\boldsymbol C}}_t$ and the new state Ct, where N is the size of features. The output gate generates a value ot between 0 and 1 to determine how much information in the cell state will make sense, and finally gets the output information $\vec{{\boldsymbol h}}_t$ of the cell. The inherent logic of a LSTM cell is described by the following six equations.
Wi, Wf, WC, Wo ∈ ℝu×2N represent the weight matrices, u is the hidden size of the LSTM layer, bi, bf, bC, bo ∈ ℝu represent the bias vectors, ${\parallel}$ means vector concatenation, σ is the sigmoid function, and tanh is the hyperbolic tangent function. Note that the backward equation can be derived similarly by replacing $\vec{{\boldsymbol h}}$ with ${\vskip -7pt {{ \leftarrow}}}\hskip -5pt {\boldsymbol h}$.
After these two steps, the result of the Bi-LSTM layer is calculated by the following equation:
ht represents the output of the Bi-LSTM layer, c 1 and c 2 are weights of two steps respectively.
For the Fully Connected Layer, we have:
where a is the output vector, ${\vector W}$ is the weight matrix between the Bi-LSTM layer and the MLP Layer, H is the concatenation of h1, …, hT, b is the bias vector, and σ represents the activation function, which the sigmoid function is used in this layer. All of the neurons are fed into the Output Layer and this Layer sums all of the information by this equation:
where $\hat{Y}$ is the final result, a is the output of the Fully Connected Layer and w is the weight vector between the Fully Connected Layer and the Output Layer, b is the bias value, and σ represents the activation function which is also the sigmoid function.
Model evaluation
The ARIMA, GBDT, SVM, LSTM, and the Bi-LSTM models are adopted in this study. The data from January 2009 to September 2014 are used as the training set to fit models, and data from October 2014 to March 2017 are used as the testing set to evaluate the prediction accuracy of different models. We use three standards, mean absolute per cent error (MAPE), root mean square error (RMSE) and mean absolute error (MAE), to estimate the results, compare the performance of these three models, and evaluate the influence of each meteorological factor. RMSE represents the sample standard deviation of the difference between the predicted value and the observed value. When the predicted value is completely consistent with the true value, i.e., RMSE is equal to 0, it is a perfect model. MAPE and MAE are also used to evaluate the model, and the less the value, the more accurate the model. The formula of each value is shown below.
where n represents the number of months, y i and $\hat{y}_i$ are the true incidence and the observed incidence of the i-th month, respectively.
Statistical software
All statistical analyses are performed using Python software version 3.5.0. The ARIMA model is built using the pmdarima library, and the LSTM and the BiLSTM model are built using the tensorflow library.
Results
General description
A total of 44 923 cases of hepatitis E are detected in Jiangsu Province, from January 2005 to March 2017. The average monthly incidence rate is 0.35 per 100 000 persons in Jiangsu Province, as shown in Figure 3. The monthly incidence rate of hepatitis E varied seasonally, peaking in January through March.
Model fitting
Bi-LSTM model
For the hyperparameter setting, the timestep is set to 2. The test scale is set to 30 months. The hidden neuron is set to 6. The epoch is set to 128. The batch size is set to 32. The optimizer is set to Adam, and the loss function is set to CrossEntropy. These are the optimal Hyper-parameters when using the monthly incidence of hepatitis E from January 2005 to September 2014 as the training set in the Bi-LSTM model. The models ARIMA, GBDT, SVM, LSTM and Bi-LSTM are employed to predict the monthly incidence of hepatitis E from October 2014 to March 2017. The comparison of three metrics of the models is shown in Table 1.
Bi-LSTM model + meteorological factors
Meteorological factors of temperature, atmosphere, water vapour pressure, rainfall, wind speed and humidity are included in the Bi-LSTM Model. Meteorological factors of temperature + water vapour Pressure + rainfall as a combination in the Bi-LSTM Model is the optimal among the 63 combinations. Table 2 shows the top 15 combinations. The comparison of three metrics of the models is shown in Table 3 and Figure 4, demonstrating the observed incidence curve and predicting curves of the models.
Besides, we also compare the predictive intervals of all the models mentioned in Figure 4. For the neural network models, predictive intervals are also possible by adding dropout layers. We show the 95% CI of the ARIMA model and the result of the neural network models with different dropout layers in Figure 5.
Discussion
Accurately understanding the epidemic trend in advance is essential to the prevention and control of infectious diseases. Hepatitis E is considered an infectious disease mainly confined to areas with poor sanitation and contaminated drinking water supplies. However, as it is also a zoonotic disease and some transmission modes are unknown, more cases have occurred in non-endemic areas including Jiangsu Province, China. Research on its epidemic pattern has drawn extensive attention in recent years, and some researchers have proposed different prediction methods for hepatitis E. For instance, Wang et al. [Reference Wang17] use the ARIMA model, Ren et al. [Reference Ren16] explore a mixture model using the ARIMA and the back-propagation artificial neural network, and Guo et al. [Reference Guo14] adopt the SVM and the LSTM, and Peng et al. [Reference Peng15] develop the machine ensemble learning methods, including GBDT and random forest.
This paper attempts to establish prediction models of different types and different complexity by using the monthly incidence rate of hepatitis E from January 2005 to March 2017, including ARIMA, SVM, LSTM, GBDT and Bi-LSTM models (the original Bi-LSTM model and the Bi-LSTM model with meteorological factors). Experimental results show that our Bi-LSTM model with meteorological factors is significantly superior to other models in predicting the monthly incidence of hepatitis E in Jiangsu province. In the prospective prediction stage, its RMSE is less than 0.05, MAPE is less than 20%, and MAE is less than 0.04. The seasonal fluctuation of hepatitis E in the next 30 months is accurately estimated. In this study, when we add the number of layers in the FC layer, the effect does not improve, thus we only used one layer in FC. The decision of batch size depends on the device we use, and with the increase of batch size, the data that the device can compute each time also increases. The number of iterations also depends on the dataset. We find that the model converges at around 220 iterations, so we set iteration to 220. The results also illustrate that when the hyperparameter: timestep is set to 2, the model has the best accuracy. It is consistent with the average incubation period of hepatitis E being near one month [Reference Shrestha18].
A large number of studies have shown that infectious diseases are sensitive to climate [Reference Xiang19–Reference Semenza and Menne21]. The climate factors may affect the survival and transmission of infectious disease pathogens in the environment, the host susceptibility and exposure opportunities. In recent years, the influence of meteorological factors such as humidity, temperature and rainfall on the epidemic of hepatitis E has attracted extensive attention [Reference Guo14, Reference Ren16, Reference Johne22]. However, the model does not show satisfactory performance. In this study, for the establishment of the model, we introduce Bi-LSTM, a model which can capture useful features from both sides. When predicting the incidence rate of a certain month, such a situation is likely to occur that the number of patients in the previous month is small, but the climate conditions of the current month are suitable for the spread of the virus, thus the number of patients increases this month. Based on the characteristics of hepatitis E, we estimate that the number of patients next month will also increase to a certain extent. At this time, we can use the input at the next moment to correct the current prediction. For these common meteorological factors, we compare the influence of their combination and find that the most influential group is using temperature, water vapour pressure and rainfall. As a result, our model has indeed achieved the state-of-the-art performance in predicting the monthly incidence of hepatitis E.
Conclusion
In this paper, we propose a new Bi-LSTM model with various meteorological factors to predict the monthly incidence of hepatitis E in Jiangsu Province, China, and compared it with existing models using ARIMA, SVM, LSTM and GBDT. The Bi-LSTM model with the meteorological factors of temperature, water vapour pressure, and rainfall can fully extract the linear and non-linear information from the incidence data of hepatitis E and has made significant improvements in interpretability, learning ability, generalisability and prediction accuracy.
Acknowledgements
We are grateful to the staff of medical institutions at all levels and municipal and county-level Center for Disease Control and Prevention centres for their valuable assistance in coordinating data collection.
Author contributions
X.Q. C. and M.H. W. conceived and designed the study, and performed the analysis and wrote the manuscript. W.D. L., X.F. Z., T.X. W., C.J. B. contributed to the revision of the manuscript draft. All authors read and approved the final manuscript.
Financial support
This study is supported by Jiangsu Province Science & Technology Demonstration Project for Emerging Infectious Diseases Control and Prevention (No.BE2015714), Key Medical Discipline of Epidemiology (No. ZDXKA2016008), the National Natural Science Foundation of China (No. 62006040), the Project for the Doctor of Entrepreneurship and Innovation in Jiangsu Province (No. JSSCBS20210126), the Fundamental Research Funds for the Central Universities, China and ZhiShan Young Scholar Program of Southeast University.
Conflict of interest
The authors declare that they have no conflict of interests.
Consent for publication
Not applicable.
Ethical standards
This work is part of the routine duties of China's Jiangsu Provincial Center for Disease Control and Prevention. Therefore, institutional review and informed consent are not claimed. All analysed data are anonymous.
Data availability statement
Data supporting the conclusions of this article are included within the article.