Predicting incidence of hepatitis E using machine learning in Jiangsu Province, China

Xiaoqing Cheng; Wendong Liu; Xuefeng Zhang; Minghao Wang; Changjun Bao; Tianxing Wu

doi:10.1017/S0950268822001303

Predicting incidence of hepatitis E using machine learning in Jiangsu Province, China

Published online by Cambridge University Press: 28 July 2022

and

Xiaoqing Cheng: Affiliation:
Jiangsu Provincial Centre for Disease Control and Prevention (Jiangsu Institution of Public health), Nanjing, Jiangsu, China Chinese Field Epidemiology Training Program, Chinese Center for Disease Control and Prevention, Beijing, China
Wendong Liu: Affiliation:
Jiangsu Provincial Centre for Disease Control and Prevention (Jiangsu Institution of Public health), Nanjing, Jiangsu, China
Xuefeng Zhang: Affiliation:
Jiangsu Provincial Centre for Disease Control and Prevention (Jiangsu Institution of Public health), Nanjing, Jiangsu, China
Minghao Wang*: Affiliation:
School of Computer Science and Engineering, Southeast University, Nanjing, China
Changjun Bao*: Affiliation:
Jiangsu Provincial Centre for Disease Control and Prevention (Jiangsu Institution of Public health), Nanjing, Jiangsu, China
Tianxing Wu*: Affiliation:
School of Computer Science and Engineering, Southeast University, Nanjing, China
*: Authors for correspondence: Tianxing Wu, E-mail: [email protected]; Minghao Wang, E-mail: [email protected]; Changjun Bao, E-mail: [email protected]
Authors for correspondence: Tianxing Wu, E-mail: [email protected]; Minghao Wang, E-mail: [email protected]; Changjun Bao, E-mail: [email protected]
Authors for correspondence: Tianxing Wu, E-mail: [email protected]; Minghao Wang, E-mail: [email protected]; Changjun Bao, E-mail: [email protected]

Article contents

Abstract
Method
Results
Discussion
Conclusion
Author contributions
Financial support
Conflict of interest
Consent for publication
Ethical standards
Data availability statement
Footnotes
References

Rights & Permissions

Abstract

Hepatitis E is an increasingly serious worldwide public health problem that has attracted extensive attention. It is necessary to accurately predict the incidence of hepatitis E to better plan ahead for future medical care. In this study, we developed a Bi-LSTM model that incorporated meteorological factors to predict the prevalence of hepatitis E. The hepatitis E data used in this study are collected from January 2005 to March 2017 by Jiangsu Provincial Center for Disease Control and Prevention. ARIMA, GBDT, SVM, LSTM and Bi-LSTM models are adopted in this study. The data from January 2009 to September 2014 are used as the training set to fit models, and data from October 2014 to March 2017 are used as the testing set to evaluate the predicting accuracy of different models. Selecting models and evaluating the effectiveness of the models are based on mean absolute per cent error (MAPE), root mean square error (RMSE) and mean absolute error (MAE). A total of 44 923 cases of hepatitis E are detected in Jiangsu Province from January 2005 to March 2017. The average monthly incidence rate is 0.35 per 100 000 persons in Jiangsu Province. Incorporating meteorological factors of temperature, water vapour pressure, and rainfall as a combination into the Bi-LSTM Model achieved the state-of-the-art performance in predicting the monthly incidence of hepatitis E, in which RMSE is 0.044, MAPE is 11.88%, and MAE is 0.0377. The Bi-LSTM model with the meteorological factors of temperature, water vapour pressure, and rainfall can fully extract the linear and non-linear information in the hepatitis E incidence data, and has significantly improved the interpretability, learning ability, generalisability and prediction accuracy.

Keywords

Forecast hepatitis E mathematical model

Type: Original Paper
Information: Epidemiology & Infection , Volume 150 , 2022 , e149

DOI: https://doi.org/10.1017/S0950268822001303 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: Copyright © The Author(s), 2022. Published by Cambridge University Press

Hepatitis E is a new zoonotic disease caused by the hepatitis E virus (HEV). The clinical manifestations of hepatitis E are similar to those of hepatitis A, such as fatigue, anorexia, and jaundice, but the severity of symptoms and mortality of hepatitis E are higher than those of hepatitis A [Reference Goel and Aggarwal1]. Humans are generally susceptible to HEV, but the virus mainly infects people aged 15–60 years. The mortality rate of the general population is from 1% to 3%, and the mortality rate of pregnant patients is from 5% to 25%. It can also cause neonatal hepatitis E or even death by vertical transmission [Reference Jin2].

Hepatitis E infection has a global distribution, but mainly in India, China, Pakistan, Mexico, and some other countries in Asia and Africa [Reference Kamar3]. In the past decade, the occasional outbreaks of hepatitis E are on the rise in some high-income countries. A growing number of local sporadic cases of hepatitis E that the route of infection cannot be determined are threatening human health [Reference Capai, Charrel and Falchi4]. According to estimates from the study on the global burden of hepatitis E, there are approximately 20 million hepatitis E infections each year, resulting in more than 3 million symptomatic hepatitis E cases, and 55 000 hepatitis E-related deaths, which makes it an important public health concern [Reference Blum5].

At present, there are mainly four genotypes of HEV. Genotypes 1 and 2 cause interpersonal outbreaks or epidemics, whereas genotypes 3 and 4 mainly infect several species of mammals (e.g. pigs, sheep, etc.), but at the same time, they also infect humans under certain conditions, causing sporadic hepatitis E [Reference Dalton and Izopet6, Reference Nelson, Labrique and Kmush7]. The hepatitis E virus is mainly transmitted by contaminated water and food through the fecal-oral route and it is verified that meteorological factors are related to the incidence of hepatitis E, as climate change will influence the environment, which may affect the quality of water and food [Reference Chen8–Reference Lake10].

In China, the areas with the highest incidence of hepatitis E are mainly in the northwest and the east, and Jiangsu Province is one of the areas [Reference Sun11]. Therefore, predicting the incidence of hepatitis E is quite indispensable. However, the existing disease surveillance information management system lacks an effective prediction and early warning mechanism. The establishment of a scientific, dependable, robust mathematical model can effectively solve this problem.

Currently, most researchers utilised the autoregressive integrated moving average (ARIMA) model to predict the incidence of hepatitis E [Reference Liu12, Reference Hu, Zu and Peng13]. However, the result might be unsatisfactory due to data linearity requirements. To tackle this problem, non-linear machine learning models, including support vector machine (SVM) [Reference Guo14], gradient boosting decision tree (GBDT) [Reference Peng15], back-propagation neural networks (BPNN) [Reference Ren16] and long short-term memory (LSTM) [Reference Guo14], are adopted to the prediction and early warning of hepatitis E. At present, the state-of-the-art model is LSTM used by Guo et al. [Reference Guo14]. This model can not only accurately capture the features of sequential data, but also effectively avoid the problems of vanishing gradients and exploding gradients on traditional recurrent neural networks. However, Guo et al. only use the past monthly incidence of hepatitis E to predict the incidence for the next month, and it cannot correct the current prediction with the input of the next time point. Besides, meteorological factors are not considered in their model [Reference Peng15]. These cause that the proposed LSTM model has much room for improvement.

Therefore, taking the above-mentioned problems, we propose a new Bi-LSTM model with various meteorological factors in this paper to predict the incidence of hepatitis E. Meanwhile, we compare our proposed model with existing models using ARIMA, SVM, LSTM and GBDT, aiming to provide the scientific basis for more effective hepatitis E incidence prediction, which also facilitates the development of the early warning system and the prevention strategies for hepatitis E in Jiangsu Province.

Method

Data source

The hepatitis E data used in this paper are collected from January 2005 to March 2017 by Jiangsu Provincial Center for Disease Control and Prevention. This dataset records the monthly incidence of hepatitis E in Jiangsu Province (P. R. China). Annual data of the demographic are obtained from Jiangsu Statistical Yearbook. The meteorological dataset is abstracted from the Jiangsu Meteorological Service Center, which contains the statistical data of 24 meteorological stations. We take the average value of the meteorological data observed by each station as the predicting value of monthly meteorological data.

The Bi-LSTM model

Model structure

As shown in Figure 1, four layers are constructed in the Bi-LSTM model, which are the Input Layer, Bi-LSTM Layer, Fully Connected Layer and Output Layer, respectively. The input of the model includes monthly feature vectors in the past, each of which is composed of monthly incidence of hepatitis E, monthly average temperature, monthly average water vapour pressure, etc. Such sequential vectors are entered into the Bi-LSTM Layer, which can optimise the prediction from both the previous input and the following data. This characteristic makes our model more robust and transferable. After the vector output by Bi-LSTM passes through the fully connected layer, the output result of the hepatitis E incidence rate of the current month can be obtained.

Fig. 1. The structure of our proposed model.

Model prediction

After preprocessing, we take the monthly incidence of hepatitis E and meteorological factors as monthly feature vectors. For each vector x_t, we use a Min-Max-Scaler to normalise all dimensions. After setting the timestep T, we use previous T months' feature vectors to predict the incidence of hepatitis E for the current month. The Bi-LSTM Layer includes two steps. The input sequence is entered into the LSTM cells in the forward step, and after that, the reverse form of the input sequence is fed to other LSTM cells, which is called the backward step. $\vec {\boldsymbol h}_t$ and ${\vskip -7pt {{ \leftarrow}}}\hskip -5pt {\boldsymbol h}_t$ are used to represent the output in each step. The output of the Bi-LSTM layer is denoted as h_t.

During the forward step, the input sequence is fed to the LSTM cells, each of which consists of three gates, as shown in Figure 2. The input gate generates a value i_t between 0 and 1 to determine how much new information needs to be retained. The forget gate generates a value f_t between 0 and 1 to decide how much information should be neglected from the previous memory. With current input x_t ∈ ℝ^N×1 and previous state $\vec{{\boldsymbol h}}_{t-1}$, we get the candidate for new information $\tilde{{\boldsymbol C}}_t$ and the new state C_t, where N is the size of features. The output gate generates a value o_t between 0 and 1 to determine how much information in the cell state will make sense, and finally gets the output information $\vec{{\boldsymbol h}}_t$ of the cell. The inherent logic of a LSTM cell is described by the following six equations.

(1)$${\boldsymbol i}_t = \sigma ( {( {{\boldsymbol W}_i\cdot ( {{\vec{{\boldsymbol h}}}_{t-1}\parallel {\boldsymbol x}_t} ) } ) + {\boldsymbol b}_i} ) $$

(2)$${\boldsymbol f}_t = \sigma ( {( {{\boldsymbol W}_f\cdot ( {{\vec{{\boldsymbol h}}}_{t-1}\parallel {\boldsymbol x}_t} ) } ) + {\boldsymbol b}_f} ) $$

(3)$$\widetilde{{{\boldsymbol C}_t}} = tanh( {( {{\boldsymbol W}_C\cdot ( {{\vec{{\boldsymbol h}}}_{t-1}\parallel {\boldsymbol x}_t} ) } ) + {\boldsymbol b}_C} ) $$

(4)$${\boldsymbol C}_t = {\boldsymbol f}_t\ast {\boldsymbol C}_{t-1} + {\boldsymbol i}_t\ast \widetilde{{{\boldsymbol C}_t}}$$

(5)$${\boldsymbol o}_t = \sigma ( {( {{\boldsymbol W}_o\cdot ( {{\vec{{\boldsymbol h}}}_{t-1}\parallel {\boldsymbol x}_t} ) } ) + {\boldsymbol b}_o} ) $$

(6)$$\vec{{\boldsymbol h}}_t = {\boldsymbol o}_t\ast \tanh ( {{\boldsymbol C}_t} ) $$

W_i, W_f, W_C, W_o ∈ ℝ^u×2N represent the weight matrices, u is the hidden size of the LSTM layer, b_i, b_f, b_C, b_o ∈ ℝ^u represent the bias vectors, ${\parallel}$ means vector concatenation, σ is the sigmoid function, and tanh is the hyperbolic tangent function. Note that the backward equation can be derived similarly by replacing $\vec{{\boldsymbol h}}$ with ${\vskip -7pt {{ \leftarrow}}}\hskip -5pt {\boldsymbol h}$.

Fig. 2. The structure of a LSTM cell.

After these two steps, the result of the Bi-LSTM layer is calculated by the following equation:

(7)$${\boldsymbol h}_t = c_1\vec{\boldsymbol { h}}_t + {\rm c}_2 \; {\vskip -8pt {{ \leftarrow}}}\hskip -5pt {\boldsymbol h}_t$$

h_t represents the output of the Bi-LSTM layer, c ₁ and c ₂ are weights of two steps respectively.

For the Fully Connected Layer, we have:

(8)$${\boldsymbol a} = \sigma ( {{\boldsymbol WH} + {\boldsymbol b}} ) $$

where a is the output vector, ${\vector W}$ is the weight matrix between the Bi-LSTM layer and the MLP Layer, H is the concatenation of h₁, …, h_T, b is the bias vector, and σ represents the activation function, which the sigmoid function is used in this layer. All of the neurons are fed into the Output Layer and this Layer sums all of the information by this equation:

(9)$$\hat{Y} = \sigma ( {{\boldsymbol wa} + b} ) $$

where $\hat{Y}$ is the final result, a is the output of the Fully Connected Layer and w is the weight vector between the Fully Connected Layer and the Output Layer, b is the bias value, and σ represents the activation function which is also the sigmoid function.

Model evaluation

The ARIMA, GBDT, SVM, LSTM, and the Bi-LSTM models are adopted in this study. The data from January 2009 to September 2014 are used as the training set to fit models, and data from October 2014 to March 2017 are used as the testing set to evaluate the prediction accuracy of different models. We use three standards, mean absolute per cent error (MAPE), root mean square error (RMSE) and mean absolute error (MAE), to estimate the results, compare the performance of these three models, and evaluate the influence of each meteorological factor. RMSE represents the sample standard deviation of the difference between the predicted value and the observed value. When the predicted value is completely consistent with the true value, i.e., RMSE is equal to 0, it is a perfect model. MAPE and MAE are also used to evaluate the model, and the less the value, the more accurate the model. The formula of each value is shown below.

(10)$$RMSE = \sqrt {\displaystyle{1 \over n}\mathop \sum \limits_{i = 1}^n {( {y_i-{\hat{y}}_i} ) }^2} $$

(11)$$MAPE = \displaystyle{{100\% } \over n}\mathop \sum \limits_{i = 1}^n \left\vert {\displaystyle{{{\hat{y}}_i-y_i} \over {y_i}}} \right\vert $$

(12)$$MAE = \displaystyle{1 \over n}\mathop \sum \limits_{i = 1}^n \vert {( {y_i-{\hat{y}}_i} ) } \vert $$

where n represents the number of months, y _i and $\hat{y}_i$ are the true incidence and the observed incidence of the i-th month, respectively.

Statistical software

All statistical analyses are performed using Python software version 3.5.0. The ARIMA model is built using the pmdarima library, and the LSTM and the BiLSTM model are built using the tensorflow library.

Results

General description

A total of 44 923 cases of hepatitis E are detected in Jiangsu Province, from January 2005 to March 2017. The average monthly incidence rate is 0.35 per 100 000 persons in Jiangsu Province, as shown in Figure 3. The monthly incidence rate of hepatitis E varied seasonally, peaking in January through March.

Fig. 3. The incidence of hepatitis E in Jiangsu Province from 01.2005 to 03.2017.

Model fitting

Bi-LSTM model

For the hyperparameter setting, the timestep is set to 2. The test scale is set to 30 months. The hidden neuron is set to 6. The epoch is set to 128. The batch size is set to 32. The optimizer is set to Adam, and the loss function is set to CrossEntropy. These are the optimal Hyper-parameters when using the monthly incidence of hepatitis E from January 2005 to September 2014 as the training set in the Bi-LSTM model. The models ARIMA, GBDT, SVM, LSTM and Bi-LSTM are employed to predict the monthly incidence of hepatitis E from October 2014 to March 2017. The comparison of three metrics of the models is shown in Table 1.

Table 1. Results of five models for monthly incidence of hepatitis E prediction

Bi-LSTM model + meteorological factors

Meteorological factors of temperature, atmosphere, water vapour pressure, rainfall, wind speed and humidity are included in the Bi-LSTM Model. Meteorological factors of temperature + water vapour Pressure + rainfall as a combination in the Bi-LSTM Model is the optimal among the 63 combinations. Table 2 shows the top 15 combinations. The comparison of three metrics of the models is shown in Table 3 and Figure 4, demonstrating the observed incidence curve and predicting curves of the models.

Fig. 4. Plot of observed monthly incidence of hepatitis E and predicted values via different models.

Table 2. Combinations of meteorological factors, ascending by RMSE

Table 3. Results of six models for monthly incidence of hepatitis E prediction

Besides, we also compare the predictive intervals of all the models mentioned in Figure 4. For the neural network models, predictive intervals are also possible by adding dropout layers. We show the 95% CI of the ARIMA model and the result of the neural network models with different dropout layers in Figure 5.

Fig. 5. Predictive Intervals of (a) ARIMA model (b) LSTM model (c) BiLSTM model (d) BiLSTM model with best meteorological factors.

Discussion

Accurately understanding the epidemic trend in advance is essential to the prevention and control of infectious diseases. Hepatitis E is considered an infectious disease mainly confined to areas with poor sanitation and contaminated drinking water supplies. However, as it is also a zoonotic disease and some transmission modes are unknown, more cases have occurred in non-endemic areas including Jiangsu Province, China. Research on its epidemic pattern has drawn extensive attention in recent years, and some researchers have proposed different prediction methods for hepatitis E. For instance, Wang et al. [Reference Wang17] use the ARIMA model, Ren et al. [Reference Ren16] explore a mixture model using the ARIMA and the back-propagation artificial neural network, and Guo et al. [Reference Guo14] adopt the SVM and the LSTM, and Peng et al. [Reference Peng15] develop the machine ensemble learning methods, including GBDT and random forest.

This paper attempts to establish prediction models of different types and different complexity by using the monthly incidence rate of hepatitis E from January 2005 to March 2017, including ARIMA, SVM, LSTM, GBDT and Bi-LSTM models (the original Bi-LSTM model and the Bi-LSTM model with meteorological factors). Experimental results show that our Bi-LSTM model with meteorological factors is significantly superior to other models in predicting the monthly incidence of hepatitis E in Jiangsu province. In the prospective prediction stage, its RMSE is less than 0.05, MAPE is less than 20%, and MAE is less than 0.04. The seasonal fluctuation of hepatitis E in the next 30 months is accurately estimated. In this study, when we add the number of layers in the FC layer, the effect does not improve, thus we only used one layer in FC. The decision of batch size depends on the device we use, and with the increase of batch size, the data that the device can compute each time also increases. The number of iterations also depends on the dataset. We find that the model converges at around 220 iterations, so we set iteration to 220. The results also illustrate that when the hyperparameter: timestep is set to 2, the model has the best accuracy. It is consistent with the average incubation period of hepatitis E being near one month [Reference Shrestha18].

A large number of studies have shown that infectious diseases are sensitive to climate [Reference Xiang19–Reference Semenza and Menne21]. The climate factors may affect the survival and transmission of infectious disease pathogens in the environment, the host susceptibility and exposure opportunities. In recent years, the influence of meteorological factors such as humidity, temperature and rainfall on the epidemic of hepatitis E has attracted extensive attention [Reference Guo14, Reference Ren16, Reference Johne22]. However, the model does not show satisfactory performance. In this study, for the establishment of the model, we introduce Bi-LSTM, a model which can capture useful features from both sides. When predicting the incidence rate of a certain month, such a situation is likely to occur that the number of patients in the previous month is small, but the climate conditions of the current month are suitable for the spread of the virus, thus the number of patients increases this month. Based on the characteristics of hepatitis E, we estimate that the number of patients next month will also increase to a certain extent. At this time, we can use the input at the next moment to correct the current prediction. For these common meteorological factors, we compare the influence of their combination and find that the most influential group is using temperature, water vapour pressure and rainfall. As a result, our model has indeed achieved the state-of-the-art performance in predicting the monthly incidence of hepatitis E.

Conclusion

In this paper, we propose a new Bi-LSTM model with various meteorological factors to predict the monthly incidence of hepatitis E in Jiangsu Province, China, and compared it with existing models using ARIMA, SVM, LSTM and GBDT. The Bi-LSTM model with the meteorological factors of temperature, water vapour pressure, and rainfall can fully extract the linear and non-linear information from the incidence data of hepatitis E and has made significant improvements in interpretability, learning ability, generalisability and prediction accuracy.

Acknowledgements

We are grateful to the staff of medical institutions at all levels and municipal and county-level Center for Disease Control and Prevention centres for their valuable assistance in coordinating data collection.

Author contributions

X.Q. C. and M.H. W. conceived and designed the study, and performed the analysis and wrote the manuscript. W.D. L., X.F. Z., T.X. W., C.J. B. contributed to the revision of the manuscript draft. All authors read and approved the final manuscript.

Financial support

This study is supported by Jiangsu Province Science & Technology Demonstration Project for Emerging Infectious Diseases Control and Prevention (No.BE2015714), Key Medical Discipline of Epidemiology (No. ZDXKA2016008), the National Natural Science Foundation of China (No. 62006040), the Project for the Doctor of Entrepreneurship and Innovation in Jiangsu Province (No. JSSCBS20210126), the Fundamental Research Funds for the Central Universities, China and ZhiShan Young Scholar Program of Southeast University.

Conflict of interest

The authors declare that they have no conflict of interests.

Consent for publication

Not applicable.

Ethical standards

This work is part of the routine duties of China's Jiangsu Provincial Center for Disease Control and Prevention. Therefore, institutional review and informed consent are not claimed. All analysed data are anonymous.

Data availability statement

Data supporting the conclusions of this article are included within the article.

Footnotes

These authors contributed equally to this work.

References

Goel, A and Aggarwal, R (2020) Hepatitis E: epidemiology, clinical course, prevention, and treatment. Gastroenterology Clinics of North America 49, 315–330.CrossRef Google Scholar PubMed

Jin, H et al. (2016) Case-fatality risk of pregnant women with acute viral hepatitis type E: a systematic review and meta-analysis. Epidemiology and Infection 144, 2098–2106.CrossRef Google Scholar PubMed

Kamar, N et al. (2017) Hepatitis E virus infection. Nature reviews Disease Primers 3, 17086.CrossRef Google Scholar PubMed

Capai, L, Charrel, R and Falchi, A (2018) Hepatitis E in high-income countries: what do we know? And what are the knowledge gaps?. Viruses 10, 285.CrossRef Google Scholar PubMed

Blum, HE (2016) History and global burden of viral hepatitis. Digestive Diseases (Basel, Switzerland) 34, 293–302.CrossRef Google Scholar PubMed

Dalton, HR and Izopet, J (2018) Transmission and epidemiology of hepatitis E virus genotype 3 and 4 infections. Cold Spring Harbor Perspectives in Medicine 8, a032144.CrossRef Google Scholar PubMed

Nelson, KE, Labrique, AB and Kmush, BL (2019) Epidemiology of genotype 1 and 2 hepatitis E virus infections. Cold Spring Harbor Perspectives in Medicine 9, a031732.CrossRef Google Scholar PubMed

Chen, YJ et al. (2016) Epidemiological investigation of a tap water-mediated hepatitis E virus genotype 4 outbreak in Zhejiang Province, China. Epidemiology and Infection 144, 3387–3399.CrossRef Google Scholar PubMed

Wenjing, Y, Canming, Z and Cailin, C (2018) Analysis of the association between intestinal infectious diseases and climate factors of Fujian Province in 2006—2015. Medical Theory and Practice 31, 3333–3337.Google Scholar

Lake, IR (2017) Food-borne disease and climate change in the United Kingdom. Environmental Health: A Global Access Science Source 16(suppl. 1), 117.CrossRef Google Scholar PubMed

Sun, XJ et al. (2019) Epidemiological analysis of viral hepatitis E in China, 2004–2017. Zhonghua yu fang yi xue za zhi [Chinese Journal of Preventive Medicine] 53, 382–387.Google Scholar

Liu, K et al. (2016) Identification of distribution characteristics and epidemic trends of hepatitis E in Zhejiang Province, China from 2007 to 2012. Scientific Reports 6, 25407.CrossRef Google Scholar PubMed

Hu, J, Zu, R and Peng, Z (2011) Application of time series analysis in the prediction of incidence trend of hepatitis E in Jiangsu province. Journal of Nanjing Medical University (Natural Sciences) 31, 1874–1878.Google Scholar

Guo, Y et al. (2020) Prediction of hepatitis E using machine learning models. PLoS One 15, e0237750.CrossRef Google Scholar PubMed

Peng, T et al. (2020) The prediction of hepatitis E through ensemble learning. International Journal of Environmental Research and Public Health 18, 159.CrossRef Google Scholar PubMed

Ren, H et al. (2013) The development of a combined mathematical model to forecast the incidence of hepatitis E in Shanghai, China. BMC Infectious Diseases 13, 421.CrossRef Google Scholar PubMed

Wang, YS et al. (2020) Trend analysis and prediction of viral hepatitis incidence in China, 2009–2018. Zhonghua liu xing bing xue za zhi = Zhonghua liuxingbingxue zazhi 41, 1460–1464.Google Scholar

Shrestha, MP et al. (2007) Safety and efficacy of a recombinant hepatitis E vaccine. The New England Journal of Medicine 356, 895–903.CrossRef Google Scholar PubMed

Xiang, J et al. (2017) Association between dengue fever incidence and meteorological factors in Guangzhou, China, 2005–2014. Environmental Research 153, 17–26.CrossRef Google Scholar

Semenza, JC et al. (2012) Mapping climate change vulnerabilities to infectious diseases in Europe. Environmental Health Perspectives 120, 385–392.CrossRef Google Scholar PubMed

Semenza, JC and Menne, B (2009) Climate change and infectious diseases in Europe. The Lancet Infectious Diseases 9, 365–375.CrossRef Google Scholar PubMed

Johne, R et al. (2021) Stability of hepatitis E virus at high hydrostatic pressure processing. International Journal of Food Microbiology 339, 109013.CrossRef Google Scholar PubMed

Fig. 1. The structure of our proposed model.

Fig. 2. The structure of a LSTM cell.

Fig. 3. The incidence of hepatitis E in Jiangsu Province from 01.2005 to 03.2017.

Table 1. Results of five models for monthly incidence of hepatitis E prediction

Fig. 4. Plot of observed monthly incidence of hepatitis E and predicted values via different models.

Table 2. Combinations of meteorological factors, ascending by RMSE

Table 3. Results of six models for monthly incidence of hepatitis E prediction

Fig. 5. Predictive Intervals of (a) ARIMA model (b) LSTM model (c) BiLSTM model (d) BiLSTM model with best meteorological factors.

Article contents

Predicting incidence of hepatitis E using machine learning in Jiangsu Province, China

Abstract

Keywords

Method

Data source

The Bi-LSTM model

Model structure

Model prediction

Model evaluation

Statistical software

Results

General description

Model fitting

Bi-LSTM model

Bi-LSTM model + meteorological factors

Discussion

Conclusion

Acknowledgements

Author contributions

Financial support

Conflict of interest

Consent for publication

Ethical standards

Data availability statement

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests