1. Introduction
An ageing population is a major challenge that many countries are facing today. The problem arises from the fact that fertility rates are declining while life expectancy has been increasing in the past several decades without any sign of slowing down. The adverse financial outcome of people living longer than expected, and hence the possibility of outliving their retirement savings, is known as longevity risk. This long-term demographic risk has significant implications for societies and manifests as a systematic risk for pension plans and annuity providers. Policymakers rely on mortality projection to determine appropriate pension benefits and to understand the costing of different economic assumptions and regulations regarding the age of retirement of a given population. For instance, in the United Kingdom and Australia defined-benefit pension plans prior to 2000s had limited exposure to effects of longevity risk since high equity returns on pension fund wealth management portfolios were masking the impact of longevity risk, however, post-2000 declining equity returns coupled with record low interest rate financial environments has demonstrated the significance of decades of longevity improvements, posing a very real problem for pension schemes. Furthermore, by regulation, insurers who offer retirement income products are required to hold additional reserving capital to cover longevity risk. A key input to address longevity risk is the development of advanced mortality modelling methodology, so that human longevity can be predicted with better accuracy and any uncertainties can be accounted for in mortality forecasting.
Since the introduction of the Lee–Carter (LC) model (Lee & Carter, Reference Lee and Carter1992), a range of stochastic mortality models have been proposed in the literature. Renshaw & Haberman (Reference Renshaw and Haberman2003) and Renshaw & Haberman (Reference Renshaw and Haberman2006) introduce multiple period effects and cohort effect to capture the change of mortality with respect to year and year-of-birth, respectively, to the LC model. Cairns et al. (Reference Cairns, Blake and Dowd2006) proposed a two-factor period effect mortality model, known as the Cairns–Blake–Dowd (CBD) model, for pensioner ages. A cohort extension of the CBD model was studied in Cairns et al. (Reference Cairns, Blake, Dowd, Coughlan, Epstein, Ong and Balevich2009). Plat (Reference Plat2009) draws on the strengths of the LC model and the CBD model to produce an age–period–cohort model that includes a term to capture young mortality dynamics. In these well-known cases it is common practice in actuarial settings to estimate stochastic mortality models based on a singular value decomposition (SVD) approach (Lee & Carter, Reference Lee and Carter1992; Koissi et al., Reference Koissi, Shapiro and Hognas2006) or via a maximum likelihood-based approach if a discrete Poisson regression setting is considered (Brouhns et al., Reference Brouhns, Denuit and Vermunt2002; Cairns et al., Reference Cairns, Blake, Dowd, Coughlan, Epstein, Ong and Balevich2009).
A common feature of the estimation methods adopted in the frameworks mentioned above is that the dynamics of the period effect, the stochastic latent processes, are not directly incorporated into joint parameter and state estimation, and instead form a component of a second stage of estimation. Typically this involves specifying a model for the period effect for forecasting purpose only after an estimation is performed. Such approaches often suffer from a statistical lack of efficiency compared to methods that perform joint static model parameter estimation and latent process filtering. Hence, the first argument we make is that recasting different classes of mortality models in a state-space formulation can better facilitate state-space-based inference under either frequentist or Bayesian estimation. This is especially true in the case that the inference is performed jointly on the latent process and static model parameters, rather than in a less statistically efficient two-stage procedure.
We note that in the classical actuarial approach that utilises the two-stage procedure, a wide variety of model choices may be adopted to model the latent stochastic dynamics such as period effect, cohort effect, etc. These models may be time series models such as random walks with drift, seasonal autoregressive integrated moving average (ARIMA) model, long memory processes such as generalized autoregressive moving average (GARMA) model or stochastic volatility model such as generalized autoregressive conditional heteroscedasticity (GARCH) model, etc. It is upto the modeller to perform statistical estimation, model selection and model criticism stages such as via Box–Jenkins type procedures to select an appropriate model choice. In the one-stage estimation framework of the state-space model, exactly the same flexible classes of model are also admissible for the state equation, so there is no loss in model generality when working with the state-space formulation. As we will discuss in linear Gaussian state-space choices such as under seasonal ARIMA (SARIMA) class of models there are optimal state-space estimators based on Kalman filters, but in more general classes of non-linear time series models one can use Sequential Monte Carlo (SMC) filtering as discussed below. Note also that one can always carry out model diagnostics in the state-space framework to decide whether the model proposed is suitable given the data. In addition, there are various formal statistical criteria designed for model ranking as well as model selection which are applicable for the one-stage procedure in state-space setting.
Typically the studies carried out in practice and in the literature have the feature that only mortality data from the past several decades is considered. For many countries, age-specific death rates are evolving rather smoothly except for some potential change of trend in the last 50 years or so in some developed countries. Besides ARIMA models, structural change model have been proposed to take into account the trend-changing behaviour of the period effect (Li et al., Reference Li, Chan and Cheung2011; van Berkum et al., Reference van Berkum, Antonio and Vellekoop2016). Despite this, the implication of including earlier periods that exhibit significant volatility of mortality, which can be attributed to some life-critical events such as wars and epidemics, is still not yet being investigated. The ability to incorporate such structural information into a mortality model is greatly facilitated when recasting the model in a state-space formulation. Furthermore, extensions to mortality models that can also be facilitated in a state-space formulation are increasingly able to be considered and may better explain the stochastic dynamics of such processes. These include features such as time-varying volatility; cross-sectional volatility between different age groups; extremal dependence features; cohort effects; structure breaks in regimes; long memory or persistence in mortality features in different age groups; cointegration and non-stationarity features; as well as regression-based structures that decompose mortality according to categorical features such as official death causes, regional categories, etc.
Moreover, additional stochastic factor models such as two- and three-factor models can be easily considered. This can be particularly relevant when modelling features such as trends in excess mortality in particular age groups resulting from disease epidemics (Zucs et al., Reference Zucs, Buchholz, Haas and Uphoff2005; Dawood et al., Reference Dawood, Iuliano, Reed, Meltzer, Shay, Cheng, Bandaranayake, Breiman, Brooks, Buchy, Feikin, Fowler, Gordon, Hien, Horby, Huang, Katz, Krishnan, Lal, Montgomery, Molbak, Pebody, Presanis, Razuri, Steens, Tinoco, Wallinga, Yu, Vong, Bresee and Widdowson2012), cold and heat waves (Fouillet et al., Reference Fouillet, Rey, Laurent, Pavillon, Bellec, Guihenneuc-Jouyaux, Clavel, Jougla and Hémon2006; Analitis et al., Reference Analitis, Katsouyanni, Biggeri, Baccini, Forsberg, Bisanti, Kirchmayer, Ballester, Cadum, Goodman, Hojs, Sunyer, Tittanen and Michelozzi2008) and other effects such as medical impairments, occupational hazards, hazardous persuites, geographical location of residence and ethnic origin, see Eloranta et al. (Reference Eloranta, Lambert, Andersson, Czene, Hall, Björkholm and Dickman2012) and England & Haberman (Reference England and Haberman1993). Hence, the second argument we make is that all these different model structures can readily be encoded in state-space model structures. Furthermore, they can be consistently combined in joint estimation procedures in such state-space model structures in either frequentist and Bayesian formulations, whilst also admitting consistent joint forecasting models for predictive purposes.
A variety of state-space model approaches exist in a range of different literatures, in this paper we propose to begin with the widely adopted frameworks typically introduced in state-space modelling settings in Harvey (Reference Harvey1989) or West & Harrison (Reference West and Harrison1997), which we develop to address some of the aforementioned issues. In contrast to the SVD and Poisson regression estimation approaches where the period effect is treated as parameter without any temporal structure in the first-stage estimation, period effect is regarded as a latent process with a Markovian structure under the state-space approach. In other words, a state-space formulation permits modelling, estimation and forecasting of mortality under a unified framework. Recent progress in sampling-based techniques has allowed statistical inference to be conducted on sophisticated state-space models that can incorporate multiple latent-driving factors which may exhibit non-linear and non-Gaussian stochastic dynamics. We take advantage of this development and utilise realistic model to capture the long-term volatility structure of mortality time series.
Pedroza (Reference Pedroza2006) and Kogure & Kurachi (Reference Kogure and Kurachi2010) consider Bayesian estimation of the LC model in state-space form. A maximum likelihood approach is studied in De Jong & Tickle (Reference De Jong and Tickle2006). Here, we extend such frameworks to show how to adopt a combination of filtering procedures with Rao-Blackwellisation to obtain gradient-based Fisher score equation recursions to accurately and efficiently perform optimal filtering of the latent state process, in the sense of mean square error minimisation, and recursive least squares estimation for the static model parameters jointly in a recursive manner. Furthermore, we extend such state-space models to incorporate non-linear and non-Gaussian features in the state-space structure that no longer admit simple Kalman filter forward-backward algorithm recursions, leading us to more cutting edge filtering techniques based on SMC methods. In this regard, we estimate and examine the LC model with heteroscedasticity using both gradient-based maximum likelihood and Bayesian analysis. Alternative models that have tried to include such features include, for example, the Poisson regression in Brouhns et al. (Reference Brouhns, Denuit and Vermunt2002) and Czado et al. (Reference Czado, Delwarde and Denuit2005), who aimed to replace the homogeneous additive error term in the LC model by a Poisson error structure. Also, we note a recently developed framework for modelling death counts with common latent risk factors via credit risk plus methodology with model estimation via Markov chain Monte Carlo (MCMC) in Hirz et al. (Reference Hirz, Schmock and Shevchenko2017). However, we argue that the state-space formulation allows heteroscedasticity to be accounted for in a more straightforward manner.
Note that Pedroza (Reference Pedroza2006) also considers a LC model with heteroscedasticity structure. Our approach in this paper, however, is more general compared to Pedroza (Reference Pedroza2006) as we explain in details the machinery behind the gradient-based maximum likelihood estimation and sampling-based inference, not confining to the restricted class of linear and Gaussian state-space models as previous studies focus on. Such an effort is important especially for researchers and practitioners who are interested in more realistic and sophisticated models which would fall into the class of non-linear and non-Gaussian models, while not wanting to be handicapped by the complications arising from model estimations.
Through reformulation and extensions of the LC type mortality models in a state-space model structure, we investigate several key properties observed in mortality data. First, the cross-sectional variance–covariance matrix between age group structures is non-homogeneous. Second, examination of mortality data over a long period indicates that volatility of the evolution of death rates is not constant, that is, heteroscedasticity is present. We show that the incorporation of a second stochastic volatility latent factor will allow us to identify the periods in which mortality demonstrates heightened volatility. This will aid in interpretation and forecasting from such models. Specifically, we introduce a stochastic volatility model for the period effect, aiming to capture long-term mortality dynamics. The state-space framework provides a natural platform to analyse stochastic volatility models (Kim et al., Reference Kim, Shephard and Chib1998; Chib et al., Reference Chib, Nardari and Shephard2002). In this paper we develop a particle Markov chain Monte Carlo (PMCMC) (Andrieu et al., Reference Andrieu, Doucet and Holenstein2010). Bayesian model formulation in order to estimate the resulting stochastic volatility model of mortality jointly with the other latent stochastic factors and the static model parameters.
We introduce to mortality modelling the estimation framework based around the PMCMC algorithm which utilises SMC (Doucet et al., Reference Doucet, De Freitas and Gordon2001; Peters et al., Reference Peters, Fan and Sisson2012) to obtain required quantities in Metropolis–Hastings algorithms that has found many applications in a variety of areas, for example, finance (Peters et al., Reference Peters, Briers, Shevchenko and Doucet2013), economics (Flury & Shephard, Reference Flury and Shephard2011), non-life insurance (Peters et al., 2010 Reference Peters, Wüthrich and Shevchenkob ), risk management (Targino et al., Reference Targino, Peters and Shevchenko2015) and computational biology (Golightly & Wilkinson, Reference Golightly and Wilkinson2011). We apply this powerful tool in mortality modelling and it allows us to develop efficient algorithms to estimate a stochastic volatility extension of the LC model.
Recently, there are growing interests in applying a state-space framework to the pricing and hedging of longevity risk. For example, Kogure & Kurachi (Reference Kogure and Kurachi2010) considers the LC model in state-space form and show that a pricing risk premium for longevity instruments can be derived based on the maximum entropy principle. As another example, Liu & Li (2016 Reference Liu and Lia ) investigates longevity hedging strategies accounting for population basis riskFootnote 1 by formulating multi-population mortality models in the state-space framework. Their approach relies crucially on the determination of the sensitivities of the hedge portfolio and the liability with respect to the hidden states of a state-space multi-population mortality model. In another paper, Liu & Li (2016 Reference Liu and Lib ) addresses the issues of trend-changing behaviours of mortality rates by considering a locally linear CBD model in state-space representation. In addition, they consider a state-space hedging method including drift risk where the hidden states play an important role. We believe that our general presentation of the state-space framework in the paper, including the machinery such as PMCMC developed in the statistics literature and extended models we proposed, will further facilitate the potentials of utilising state-space methods in modelling human mortality and the risk management of longevity risk.
The paper is organised as follows. In section 2 we give an overview of the conventional mortality modelling and estimation methodology in the literature. A state-space approach for mortality modelling is formulated and discussed in section 3. Section 4 is devoted to state-space inference for stochastic mortality models in a frequentist approach. Section 5 focusses on Bayesian inference for dynamic mortality models in state-space framework. In section 6 we analyse Danish mortality data based on the enhanced models and methodologies proposed in the paper. Section 7 provides concluding remarks.
2. Classical Bayesian and Frequentist Approaches
In this section we first briefly recall some important definitions on mortality modelling. We then review stochastic mortality models that are commonly found in the literature. Standard estimation procedures under frequentist and Bayesian approaches are discussed.
2.1. Definitions and notation
Hereafter, we use the following standard definitions from actuarial literature on mortality modelling (Dickson et al., Reference Dickson, Hardy and Waters2009; Pitacco et al., Reference Pitacco, Denuit, Haberman and Olivieri2009). Let T x be a random variable representing the remaining lifetime of a person aged x. The cumulative distribution function and survival function of T x are written as $$_{\tau } q_{x} \, {\equals}\, P\left( {T_{x} \leq \tau } \right)$$ and $$_{\tau } p_{x} \, {\equals}\, P\left( {T_{x} \,\gt\,\tau } \right)$$ , respectively. For a person aged x, the force of mortality at age x+τ is defined as
Let f x (t) be the density function of T x , then from (1) we have $$_{\tau } q_{x} \, {\equals}\, {\int}_0^\tau \,{f_{x} (s)\:ds} \, {\equals}\, {\int}_0^\tau \,{{}_{s}p_{x} \,\mu _{{x{\plus}s}} \:ds} $$ . The central death rate for a x-year-old, where $$x\in{\Bbb N}$$ , is defined as
which is a weighted average of the force of mortality (here $$q_{x} \,\colon\! {\equals} \, {}_{1}q_{x} $$ ). Under the so-called piecewise constant force of mortality assumption, that is $$\mu _{{x{\plus}s}} \, {\equals} \, \mu _{x} $$ , where 0≤s<1 and $$x\in{\Bbb N}$$ , we have, from (2), m x =μ x . Moreover, if a Poisson assumption is made for the actual number of deaths, then the resulting maximum likelihood estimate of the force of mortality $$\hat{\mu }_{x} $$ (and hence $$\hat{m}_{x} $$ ) is given by $$\hat{\mu }_{x} \, {\equals} \, {{D_{x} } \mathord{\left/ {\vphantom {{D_{x} } {E_{x} }}} \right. \kern-\nulldelimiterspace} {E_{x} }}\, {\equals} \, \hat{m}_{x} $$ , where D x is the number of deaths recorded at age x last birthday and the exposure-to-risk E x the average number of people aged x last birthday, during the observation year. Note that E x is approximated by an estimate of the population aged x last birthday in the middle of the observation year. We refer to $$\hat{m}_{x} $$ as the crude death rate.
In the above setup it is assumed that the force of mortality μ is deterministic. The stochastic case can be handled by the intensity-based framework where death time is modelled as the first jump time of a doubly stochastic process (Biffis, Reference Biffis2005). Hereafter, we treat the force of mortality μ x+t (t), the central death rate m x,t and the crude death rate $$\hat{m}_{{x,t}} $$ as stochastic processes. For a detailed discussion of the background of stochastic mortality modelling in discrete-time and continuous-time, see Cairns et al. (Reference Cairns, Blake and Dowd2008).
2.2. Stochastic mortality models
One of the most widely considered examples of stochastic factor model in the context of mortality modelling is the approach first presented in Lee & Carter (Reference Lee and Carter1992) who proposed a stochastic mortality model for the age-specific crude death rate $$\hat{m}_{{x,t}} $$ , where x=x 1, … , x p and t=1, … , T represent age (or age group) and year (time), respectively. Under the model, the dynamics of the log crude death rates, $$y_{{x,t}} \, {\equals} \, {\rm ln}\,\hat{m}_{{x,t}} $$ , is given byFootnote 2
where $$N\left( {0,\sigma _{{\varepsilon}}^{2} } \right)$$ denotes a Gaussian distribution with zero mean and variance $$\sigma _{{\varepsilon}}^{2} $$ . The vector $${\mib \bialpha} \, {\equals} \, \alpha _{{x_{1} \! \colon\,\! x_{p} }} \,\colon\! {\equals} \, \left[ {\alpha _{{x_{1} }} ,\,\ldots\!,\alpha _{{x_{p} }} } \right]$$ represents the age-profile of the log death rates and $${\mib \bibeta } \, {\equals} \, \beta _{{x_{1} \!\colon\! \,x_{p} }} $$ measures the sensitivity of death rates for different age group to a change of the time series κ t . The period effect, κ t , for forecasting purpose, is assumed to satisfy the equation
where ε x,t and ω t are independent.
Under this specification, it is clear that the LC model is not identifiable, since (3) is invarifollowing standard definitions from actuarial literature on mortality modellinagnt up to some linear transformations of the parameters:
where $${\tilde{\bialpha }} \, {\equals} \, { \bialpha } {\plus}{ \bibeta } c$$ , $${\tilde{\bibeta }} \, {\equals} \, {{\bibeta } \mathord{\left/ {\vphantom {{\beta } d}} \right. \kern-\nulldelimiterspace} d}$$ and $$\tilde{\kappa }_{t} \, {\equals} \, \left( {\kappa _{t} \, {\minus}\, c} \right)d$$ .
To overcome this identification issue when estimating the LC model, one has to impose a non-unique choice of constraints to restrict the model to an identifiable class. It is standard practice in actuarial literature to consider the following two constraints:
as suggested in Lee & Carter (Reference Lee and Carter1992) to remedy the identifiability issue. This choice of constraints is equivalent to fixing $$c\, {\equals} \, (1\,/\,T)\mathop{\sum}\nolimits_{t\, {\equals} \, 1}^T \kappa _{t} $$ and $$d\, {\equals} \, \mathop{\sum}\nolimits_{x\, {\equals} \, x_{1} }^{x_{p} } \beta _{x} $$ . Consequently we have $$\mathop{\sum}\nolimits_{t\, {\equals} \, 1}^T \,\tilde{\kappa }_{t} \, {\equals} \, 0$$ and $$\mathop{\sum}\nolimits_{x\, {\equals} \, x_{1} }^{x_{p} } \,\tilde{\beta }_{x} \, {\equals} \, 1$$ . The reason for these particular form of identification constraints relates to the fact that the constraint on the path space of the stochastic factor κ 1, … , κ T is intended to have the effect of centring the κ t values over the range t∈{1, … , T}, such that the structure is designed to capture age–period effects with the α x terms incorporating the main age effects, averaged over time and the bilinear terms β x κ t incorporating the age-specific period trends (relative to the main age effects).
Since the introduction of the LC model it has found a widespread uptake of this class of factor model in both practice, where the LC model is now used as a benchmark methodology by the US Bureau of the Census, and in academia where a range of stochastic mortality model extensions have been proposed in the literature, see Table 1.
We note here that Renshaw & Haberman (Reference Renshaw and Haberman2003) and Renshaw & Haberman (Reference Renshaw and Haberman2006) introduces multi-period $$\left( {\mathop{\sum}\nolimits_{i\, {\equals} \, 1}^k \beta _{x}^{{(i)}} \kappa _{t}^{{(i)}} } \right)$$ and cohort factor $$\left( {\zeta _{{t\, {\minus}\, x}} } \right)$$ , respectively, to the LC method. Currie (Reference Currie2009) considers a simplified version of the model in Renshaw & Haberman (Reference Renshaw and Haberman2006). Cairns et al. (Reference Cairns, Blake and Dowd2006) propose to model $${\rm logit}\left( {q_{{x,t}} } \right)\,\colon\! {\equals} \, {\rm ln}\left( {q_{{x,t}} \,/\,(1\, {\minus}\, q_{{x,t}} )} \right)$$ instead of log death rates and $$\bar{x}$$ is the average age in the sample range. An addition of cohort factor is studied in Cairns et al. (Reference Cairns, Blake, Dowd, Coughlan, Epstein, Ong and Balevich2009). Plat (Reference Plat2009) introduces a model which combines the desirable features of the previous models and include a term $$\left( {\bar{x}\, {\minus}\, x} \right)^{{\plus}} \,\colon\! {\equals} \, {\rm max}\left( {\bar{x}\, {\minus}\, x,0} \right)$$ to capture better young age mortality. The specification of identification constraints for the LC type models, that is for those where the log death rate is being modelled in Table 1, is discussed in Hunt & Villegas (Reference Hunt and Villegas2015).
2.3. Two-stage estimation approaches: frequentist view
Several “classical” approaches to LC model estimation have been proposed in the literature, though they typically involve a two-stage procedure looking first at the observation equation as a regression (ignoring the latent factor structure explicitly) and then in the second stage they fit time series models to the latent factor structures. A good overview of such methods is obtained in Pitacco et al. (Reference Pitacco, Denuit, Haberman and Olivieri2009). This two-stage procedure is at odds with modern state-space modelling procedures which have been progressively moving towards joint parameter estimation and latent state estimation in frequentist and Bayesian formulations, which will be discussed in subsequent sections. This is reflected in the first attempt to improve the calibration approaches as reflected in the comment in Cairns et al. (Reference Cairns, Blake, Dowd, Coughlan and Khalaf-Allah2011) where they highlight that the “… key element of the proposed framework is our single-stage approach to model fitting and process parameter estimation”. Such sentiments, relating to consistent single-stage joint estimation are also echoed in the work of Czado et al. (Reference Czado, Delwarde and Denuit2005).
2.3.1. Multi-factor LC SVD-based two-stage calibration
One of the most commonly adopted approaches to estimate stochastic mortality models is via SVD. We use the multi-period (k-factor) LC model (Renshaw & Haberman, Reference Renshaw and Haberman2003) with identification constraints given by
where i=1, … , k, as an example to illustrate the methodology below (Koissi et al., Reference Koissi, Shapiro and Hognas2006).
Stage 1a – Observation Equation Estimation Stage: We first notice that the constraint $$\mathop{\sum}\nolimits_{t\, {\equals} \, 1}^T \kappa _{t}^{{(i)}} \, {\equals} \, 0$$ will lead to an estimator for the level α given by
Stage 1b – Observation Equation Estimation Stage: The next stage is to de-trend the observations { y 1:T } by the level estimate $${\hat{\bialpha }} $$ and then to perform a SVD on the resulting (p×T) matrix of residual observations to obtain the decomposition:
where ⊤ denotes transposition and ρ i , for i∈{1, … , h}, are the descending singular values where h is the rank of the data matrix. Here μ i and ν i are the corresponding left and right singular vectors of the singular value ρ i with dimension p and T, respectively. For a k-rank, where k≤h, approximation of the matrix, we have
where $${\mib \bivarphi } _{{1\: \colon\: T}} \, {\equals}\, \mathop{\sum}\nolimits_{i\, {\equals} \, k{\plus}1}^h \,\rho _{i} {\bimu } _{i} {\binu } _{i}^{ \top } $$ is the k-rank residuals. We then identify $${ \tilde{\beta }} ^{{(i)}} \, {\equals} \, { \bimu } _{i} $$ and $${ \tilde{\kappa }} ^{{(i)}} \, {\equals} \, \rho _{i} {\mib \binu } _{i} $$ , for i=1, … , k. One then performs the transformation
to ensure the constraints $$\mathop{\sum}\nolimits_x \beta _{x}^{{(i)}} \, {\equals} \, 1$$ , for i=1, … , k, are satisfied.
Stage 2 – Latent Process Factor Estimation Stage: At this stageFootnote 3 , the estimation of the latent factors can be performed by specifying a time series model structure such as ARIMA model for each of the factors:
or alternatively one could fit the equivalent vector autoregressive model structure not treating each factor as independent in the time series specification. One would typically perform this stage of estimation via the Yule–Walker equations, see for instance discussions in Tsay & Tiao (Reference Tsay and Tiao1984). Under such specifications, one then obtain closed form distributions and estimators for period effect latent factor forecasts that can be substituted into the observation model for forecasts of the mortality by age in future forecast horizons and used to construct life tables.
2.3.2. Regression-based approaches
It is important to note that the SVD approach assumes homoscedasticity in the error structure. Therefore, to account for heteroscedasticity in mortality data for different ages, Brouhns et al. (Reference Brouhns, Denuit and Vermunt2002) propose to model death counts, instead of death rates, via Poisson regression where the addition error term in the LC approach is replaced by Poisson random variation. Specifically, the number of death D x,t is modelled as
where E x,t is the death exposure, m x,t (Φ) is a model of the central death rate and Φ is the parameter vector according to the model being used, including time dynamic factors such as period and cohort effect, see, for example, Table 1. The parameter vector is then estimated by maximising the log-likelihood function, which is given by
where D x,t ! indicates the factorial of D x,t . Times series models are then used to model the time dynamic factors forming a second-stage estimation procedure for forecasting purpose. Note that the CBD type models can be estimated under this approach since we have q x,t =1−exp{−m x, t} (Cairns et al., Reference Cairns, Blake, Dowd, Coughlan, Epstein, Ong and Balevich2009).
Remark 2.1 In all the discussed cases above, there is the general idea that the two-stage estimation approaches (SVD and regression) treat the unobserved factors corresponding to for instance a period effect κt and a cohort effect ζ t−x as parameters. For forecasting purpose, these dynamics factors are then modelled as time series, typically under the ARIMA framework. In this paper we argue that a more consistent approach involves embedding the specification of the model formally within a state-space model structure and to perform the estimation via a joint combination of filtering and static parameter estimation, which can be achieved either in Bayesian (posterior-based) or frequentist (likelihood-based) settings. We will demonstrate both in this paper.
2.4. Estimation approaches: Bayesian view
From the Bayesian modelling perspective there are few papers that study stochastic mortality models, the main papers in this area involve the works of Czado et al. (Reference Czado, Delwarde and Denuit2005), Pedroza (Reference Pedroza2006), Kogure et al. (Reference Kogure, Kitsukawa and Kurachi2009) and Cairns et al. (Reference Cairns, Blake, Dowd, Coughlan and Khalaf-Allah2011). As observed in these studies, there are many possible advantages to adopting a Bayesian approach for mortality modelling, especially in the context of small populations which may also have substantial quantities of missing data.
An important point to note is that all Bayesian model formulations to date in the mortality modelling literature, that we are aware of, have utilised what would, in modern statistical approaches be considered rudimentary sampling-based approaches to performing Bayesian estimation of the LC type models. The criticism here can be levelled in two ways:
-
1. The first relates to the fact that in these Bayesian formulations the latent dynamic process states are still treated in the MCMC sampling procedures as if they were a set of static model parameters. The issues with doing this have been mentioned in numerous places, see for example, Carter & Kohn (Reference Carter and Kohn1994). Recently new approaches to such inference in Bayesian models have been developed to avoid having to make univariate conjugate Gibbs or Metropolis-within-Gibbs steps for the latent processes. The reason for this is that it is known in general to be very inefficient in performing inference and can be prone to misleading posterior inference results due to poor mixing performance of the Markov chain for a finite computational budget. Detailed discussions have been provided on such problems in Andrieu et al. (Reference Andrieu, Doucet and Holenstein2010) and subsequently in work such as Chopin et al. (Reference Chopin, Jacob and Papaspiliopoulos2013), and the specific case to population-based state-space models in ecology in Peters et al. (2010 Reference Peters, Hosack and Hayesa ). Pedroza (Reference Pedroza2006) also follows this more efficient sampling approach for mortality modelling based on Carter & Kohn (Reference Carter and Kohn1994).
-
2. Second, all existing MCMC sampling-based approaches we are aware of for Bayesian inference in the mortality modelling literature tends to neglect the issue of model identification in the likelihood which can cause issues in the Bayesian formulation. In fact, some approaches implement identification constraint in the Bayesian model and develop an MCMC sampler that tries to impose the identification constraint in such a manner that the resulting Markov chain may not be consistent with preserving the correct invariant stationary distribution if one applies the constraints inappropriately. We investigate this issue in a separate paper (Peters et al., Reference Peters, Fung and Shevchenko2017).
These two considerations need to be resolved to update the approaches to more efficient sampling approaches with enhanced specifications of the model formulation to deal with such issues directly. In particular, modern approaches to such model estimations are to treat the latent unobserved process not as static parameters but as a state-space model in which filtering-based methods (Kalman filter variants, SMC) can be utilised for the latent process estimations jointly with consistent estimation of the “static” model parameters. We will detail such estimation procedures which are also consistent with imposing specific identification constraints of relevance to the LC model formulations, that are developed to ensure the correct invariant Bayesian posterior model is preserved by the Markov chain sampler and filters developed.
Remark 2.2 (Likelihood Identification Issues and Bayesian Modelling): We note the fact that model parameters that are not identified in the likelihood pose no formal problem in a Bayesian analysis. Identification is a property of the likelihood function, whereas Bayesian inference simply uses the likelihood function to map through the data from prior beliefs to posterior beliefs. However, it is often the case that working with unidentified likelihood functions is usually unsatisfactory from a practical perspective as it may lead to partial identification issues in the posterior or problematic multimodality in the posterior. In general if one utilises a proper prior distribution it may act to provide a “near-identification” in the sense that one considers parameter restrictions as limiting forms of prior densities, then there is at least a functional equivalence between introducing prior information about parameters, and imposing identifying restrictions.
3. State-Space Formulations of Mortality Models
We are now in a position to present an alternative representations of stochastic mortality modelling based on state-space methodology (Harvey, Reference Harvey1989; West & Harrison, Reference West and Harrison1997). A key advantage of this approach is that the two-stage estimation and forecasting procedure under the SVD or Poisson regression maximum likelihood approaches can be combined in a single setting. The improved statistical consistency of a single-stage approach is recognised in Cairns et al. (Reference Cairns, Blake, Dowd, Coughlan and Khalaf-Allah2011). Another key advantage comes from the recent progress in sampling-based techniques in the estimation of state-space models. The advancement allows statistical inference to be conducted on sophisticated state-space models. We take advantage of this development and utilise realistic model aiming to capture long-term mortality dynamics.
A general state-space model consists of a state equation
and an observation equation
where the states ϕ t form a hidden/latent Markov process with disturbance ν t , and the observed time series data z t depends only on ϕ t and disturbance ν t . Here a(.) and b(.) are possibly non-linear functions, and the states ϕ t and observations z t can be multi-dimensional.
It is clear that the models in Table 1 specify the observation equation of different state-space models that can be considered. For example, for the multi-period LC model (Renshaw & Haberman, Reference Renshaw and Haberman2003), the observed data is $$z_{{x,t}} \, {\equals} \, {\rm ln}\left( {\hat{m}_{{x,t}} } \right)$$ for different age x and the latent states are the period effects $$\phi _{t} \, {\equals} \, \big( {\kappa _{t}^{{(1)}} ,\,\ldots\!,\kappa _{t}^{{(k)}} } \big)$$ . We also note that multi-population (i.e. multi-curve) structures can be incorporated in the following state-space models in a number of different ways and the approaches we will develop for estimation will accommodate such settings. In the following sub-sections we will discuss a few different classes of mortality models that are difficult to deal with in the approaches mentioned in section 2, but can be handled straightforwardly in state-space framework.
3.1. LC model with heteroscedasticity (LC-H model)
We present here a state-space formulation of the LC model with heteroscedasticity structure. In this context, the heteroscedasticity refers to a relaxation of the constant single degree of freedom diagonal covariance assumption typically made on the observation vector for each year t across the panel of age group stratifications $${\mib{y}} _{t} \, {\equals} \, \left( {y_{{x_{1} ,t}} ,y_{{x_{2} ,t}} ,\,\ldots\!,y_{{x_{p} ,t}} } \right)$$ . Within this state-space model structure we propose an alternative identification constraint which is tailored for the estimation under the state-space approach.
The LC model with heteroscedasticity structure can be written in state-space form by combining the processes $${\mib{y}} _{t} \, {\equals} \, \left( {y_{{x_{1} ,t}} ,\,\ldots\!,y_{{x_{p} ,t}} } \right)$$ and κ t into one dynamical system
where $${\mib \bialpha } \, {\equals} \, \alpha _{{x_{1} \!\colon\,x_{p} }} $$ , $${\mib \bibeta } \, {\equals} \, \beta _{{x_{1} \!\colon\,x_{p} }} $$ and κ t is the latent state of the resulting linear Gaussian state-space model. Here, Σ is a p by p diagonal matrix with $$\sigma _{{{\varepsilon},x_{1} \,\colon\,x_{p} }}^{2} $$ on the diagonal. We refer to this model as LC-H model, and the special case with $$\sigma _{{{\varepsilon},x_{i} }}^{2} \, {\equals} \, \sigma _{{\varepsilon}}^{2} $$ , i∈{1, … , p}, as LC model.
Instead of the identification constraint (6), we suggest an alternative constraint which is simpler and more readily applicable to Monte Carlo-based procedures such as MCMC and SMC. Our formulation of the identification constraint is given by setting
Such a choice is a valid identification constraint since if one of the elements of each α and β are known (here we have arbitrarily chosen $$\alpha _{{x_{1} }} $$ and $$\beta _{{x_{1} }} $$ ; in general one can choose $$\alpha _{{x_{i} }} $$ and $$\beta _{{x_{j} }} $$ , where i, j∈{1, … , p}), then a non-trivial linear transformation in (5) is not allowed anymore; that is, we must have c=0 and d=1. Note that implementing the proposed constraint is straightforward in both maximum likelihood and Bayesian setting compared to the constraint (6).
Remark 3.1 As discussed in section 2.4, one needs to be careful about the way how the constraint (6) is implemented in a MCMC-based sampling setting. An advantage of the proposed constraint (18) here is that it is error proof in this setup. Note that regardless of which constraint one chooses, a natural interpretation of α and β is that α represents the general pattern of the age-specific log death rates while β captures the sensitivity of log death rates with respect to a change of the period effect. Therefore, one can naturally set a value for $$\alpha _{{x_{1} }} $$ as the average of the log death rate at age x 1. On the other hand, there is no restriction on what value $$\beta _{{x_{1} }} $$ can take. However, we recommend it to be in the range of [0.01, 1], as too small or too large value may lead to numerical issues such as overflow/underflow problems. We also emphasise here that the resulting fit and forecasting ability of the model will not be affected by the choice of constraint since an identification constraint serves only to identify the model in a unique way.
3.2. Two-factor LC model with age-based heteroscedasticity (LC2-H model)
A natural extension of the LC-H model is to include a second stochastic factor for the cohort effect. We denote this model by LC2-H model. The cohort effect (Renshaw & Haberman, Reference Renshaw and Haberman2006) can be modelled under the state-space framework as follows:
where $$\zeta _{t}^{x} \,\colon\! {\equals} \, \zeta _{{t\, {\minus}\, x}} $$ . The state equation can be expressed as
Here, we assume κ t is a random walk with drift process and an autoregressive model of order 1 (AR(1)) process is assumed for the cohort effect, that is $$\zeta _{t}^{{x_{1} }} \, {\equals} \, \vartheta \zeta _{{t\, {\minus}\, 1}}^{{x_{1} }} {\plus}\omega _{t}^{{\zeta _{1} }} $$ , where |ϑ|<1. Note that, from (20), we have $$\zeta _{t}^{{x_{i} }} \, {\equals} \, \zeta _{{t\, {\minus}\, 1}}^{{x_{{i\, {\minus}\, 1}} }} $$ for i=2, … , p, which is the defining property of the cohort effect and consequently we are only required to model the dynamics of $$\zeta _{t}^{{x_{1} }} $$ . We can write the model (19)–(20) in the following form:
where B is the p by p+1 matrix in (19), C the corresponding p by p sub-matrix in (20) and D a zero p by p matrix except for the (1, 1) element with value 1.
For further details on the formulation, estimation and forecasting of cohort models in state-space framework, see Fung et al. (Reference Fung, Peters and Shevchenko2017).
3.3. Three-factor LC model with dynamic time-based heteroscedasticity (LC3-H2 model)
We can further extend the LC2-H model by adding a third factor for the dynamic of volatility in the observation vector over time. Specifically, a time dependent stochastic factor $$\sqrt {\gamma _{t}^{y} } $$ , which is independent to age (or age group), is incorporated into the observation noise term as follows:
where $$\gamma _{t}^{y} $$ is a process obtained via an Euler discretisation of a square Bessel process corresponding to the Cox–Ingersoll–Ross process given by
where 2ab≥σ 2 to ensure $$\gamma _{t}^{y} $$ is strictly positive. Such a dynamic volatility factor can be used to explain time-varying periods of heightened observation variance, which potentially occur in some populations over time. These may be attributed to disease, war, famine, environmental factors or shocks as well as changes in migration and immigration patterns that could influence the volatility of the observed death counts in different age groups. Note that this is one possible way to introduce stochastic volatility to mortality models where the stochastic volatility factor enters via the observation noise term. Another possible choice is to consider stochastic volatility through the state equation described in section 3.4 which will be the focus of this paper.
3.4. Multi-factor model with stochastic volatility in the latent process (LCSV and LCSV-H models)
A common assumption in mortality modelling is that the period effect is derived from a discretisation of a random walk with drift process. Such a process may be sufficient for modelling simple dynamics, but can be insufficient if time-varying periods of volatility are present in the time series. In fact, much of the literature focusses mainly on capturing the trend of the period effect κ t for the past several decades where mortality time series for many countries are reasonably smooth.
Here, we extend the LC framework to incorporate stochastic volatility in the latent process. As a result, the impact of epidemics, natural disasters, medical breakthrough or wars on the evolution of mortality can be taken into account. This will produce different structural effects on the calibration and importantly on the forecasting when compared to the previously developed model of LC3-H2. We refer (23a)–(23c) as LC stochastic volatility model which we denote as LCSV model:
The log-volatility process γ t is introduced in the state equation for κ t via the error term ω t . The process γ t is an AR(1) with |λ 1|<1 and the mean reverting level is given by λ 2/(1−λ 1). A heteroscedasticity structure can be introduced in (23a) and will be referred to as LCSV-H model. Specifically, the LCSV-H model is given by
Cohort effect can also be incorporated into the LCSV model as follows:
Compared to the LC3-H2 model where stochastic volatility is included in the observation noise term, introducing stochastic volatility in the latent period process has the advantage, in terms of simplicity and ease of interpretation, that the variability of mortality data in the time dimension is captured purely by the latent process.
4. Frequentist State-Space Inference
Given these different state-space model structures, the next task to consider is the inference for the joint single-stage state and parameter estimation. In this section, we consider full likelihood-based joint inference procedures based on filtering and gradient estimation. To achieve this we must describe both filtering in linear Gaussian and non-linear/non-Gaussian filtering via SMC method (particle filters) and their application to gradient-based estimation in the marginal likelihood, having integrated out the latent state processes. We will do this in a general way and then present particular examples of relevance to this paper.
Under the classical maximum likelihood approach, parameters are estimated by maximising a model’s log-likelihood function. In the case of state-space models in the form of (15)–(16), the likelihood is in two forms: the complete data likelihood, assuming ϕ 0 fixed, is given by
and the marginal likelihood, typically used for the static model-based inference, is given by
where $$\bipsi $$ denotes the d-dimensional parameter vector of the model. Two challenges now arise. The first is that typically the integral in (27) cannot be evaluated in closed form, except for linear Gaussian state-space model systems. The second issue is that the gradient equations for such a state-space model marginal likelihood, even if they can be calculated in closed form, requires a non-linear multiple equation solver. For this reason it is common to adopt a solution based on a recursive estimation using gradient and Hessian information from the marginal likelihood. Under the gradient-based approach, the optimal parameter vector can be found by iterations, where the (m+1)th estimate is obtained by
where $$\ell ({\mib \bipsi } )$$ , $$\nabla _{{\mib \bipsi } } \ell ({\mib \bipsi } )$$ and $$\, {\minus} \nabla _{{\mib \bipsi } }^{2} \:\ell ({\mib \bipsi } )$$ denote the log-likelihood function, the gradient (or score) vector and the Hessian information matrix of the log-likelihood function, respectively, defined with respect to grad and Laplacian differential operators given by
The iterating scheme will stop once certain criterion is met, for example, when the magnitude of the score vector is small enough. This will be illustrated using the LC-H model as an example in section 4.2.
The results developed are based on the marginal likelihood of the state-space model, with generic static model parameters $$\bipsi $$ for observations z 1:T = z 1, … , z T having integrated out latent states ϕ 1, … , ϕ T , denoted by p $$_\bipsi $$ ( z 1:T ). We are then interested in forming recursive filtering to integrate the complete data likelihood to find the marginalised likelihood and then working with recursive gradient-based estimation to update static model parameters in Newton–Descent type algorithm, or for linear Gaussian systems a recursive least squares-based approach.
As observed in Poyiadjis et al. (Reference Poyiadjis, Doucet and Singh2005) and Poyiadjis et al. (Reference Poyiadjis, Doucet and Singh2011), it is useful to consider two classes of identities for the gradient and Hessian of the marginalised likelihood, given by the Fisher’s identity and the Louis’ identity, respectively, according to
where
An important point about these recursions is that the integrals for the gradient vector and Hessian matrix are expressed in terms of the path-space distribution p $$_\bipsi $$ ( ϕ 1:T | z 1:T ). In the case of the linear Gaussian dynamics this distribution can be obtained based on variations of the Kalman filter recursion, however, when the state-space model is non-linear or non-Gaussian this distribution must be estimated via Monte Carlo methods. The most efficient of these methods for state-space modelling purposes is known as the class of SMC methods (particle filters). In this case it will be more accurate from the perspective of the variance of the estimated gradient and Hessian matrices to utilise the filter distribution-based estimators in a recursive fashion based on the local estimates of the distributions p $$_\bipsi $$ ( ϕ t | z 1:t ), for each t∈{1, … , T} rather than the path-space estimator which is based on distribution p $$_\bipsi $$ ( ϕ 1:T | z 1:T ) at the final time T. The result explaining the difference in estimation precision for the gradient and Hessian, from the perspective of variance of the solution on the path-space distribution versus filter distributions is provided in theorem 1 of Poyiadjis et al. (Reference Poyiadjis, Doucet and Singh2011). This motivates the need to work with the filter recursions.
To achieve this one can replace in the Fisher and Louis’ identities the path-space quantities p $$_\bipsi $$ ( ϕ 1:T , z 1:T ) and p $$_\bipsi $$ ( ϕ 1:T | z 1:T ) by the filter quantities given by p $$_\bipsi $$ ( ϕ t , z 1:t ) and p $$_\bipsi $$ ( ϕ t | z 1:t ). After this substitution, one may utilise the following recursive formulations to evaluate the gradient and Hessian, see Poyiadjis et al. (Reference Poyiadjis, Doucet and Singh2005) and Poyiadjis et al. (Reference Poyiadjis, Doucet and Singh2011). In this case the Fisher identity is recursively given by
The recursive form of Luis’ identity is given by
In general, the solution to these recursions can be achieved via SMC as detailed in Poyiadjis et al. (Reference Poyiadjis, Doucet and Singh2005) and Poyiadjis et al. (Reference Poyiadjis, Doucet and Singh2011).
In the following sections, we will illustrate the use of these recursive identities for the special case of the LC-H model where the state-space takes a linear Gaussian form. In this case the integrals and recursive evaluation of the gradient and Hessian can be written in closed form. To proceed we first introduce the optimal filter recursion, with respect to minimisation of mean-squared error, in the case of linear Gaussian state-space models, known as the Kalman filter (Kalman, Reference Kalman1960; Harvey, Reference Harvey1989).
4.1. Closed form filter recursions for LC-H model
The aim of filtering is to obtain the distribution of the latest state given observations. For a general state-space model, (15)–(16), the filtering density π( ϕ t | z 1:t ) at time t can be calculated sequentially by first assuming the filtering density π( ϕ t−1| z 1:t−1) at time t−1 is known. Then the one-step ahead predictive density for the state is given by
From Bayes’ formula and the structure of the conditional dependency of the state-space model, one can obtain the filtering density as
For non-linear and non-Gaussian state-space models, numerical techniques such as SMC methods are required to estimate the filtering density, see Doucet et al. (Reference Doucet, De Freitas and Gordon2001), Doucet et al. (Reference Doucet, Godsill and Andrieu2000) and Liu (Reference Liu2008).
In the case of the LC-H model, since it is a linear and Gaussian state-space model, the filtering distribution can be obtained analytically via Kalman filtering. In particular, we can find the conditional distributions of the key quantities in the filtering recursions are all Gaussian distributions as follows:
where the recursive nature of these distributions arises from the recursions of the sufficient statistics:
That is, given the filtering distribution at t−1, (34a), the filtering distribution at t is given by (34d) using (35a)–(35c).
4.2. Closed form gradient-based estimation via score and Hessian recursions for LC-H model
For the LC-H model, the log-likelihood function $$\ell ({\mib \bipsi } )\,\colon\! {\equals} \, {\rm ln}\,\pi \left( {{\mib{y}} _{{1\: \colon\: T}} \!\mid\!{\mib \bipsi } } \right)$$ is given by
where ν t := y t − f t , and $${\mib \bipsi } \, {\equals} \, \big( {\alpha _{{x_{2} \: \colon\: x_{p} }} ,\beta _{{x_{2} \: \colon\: x_{p} }} ,\theta ,\sigma _{{{\varepsilon},x_{1} \: \colon\: x_{p} }}^{2} ,\sigma _{\omega }^{2} } \big)$$ is an n-dimensional parameter vector. The log-likelihood function (54) can be derived directly from (34c).
It can be shown that (Harvey, Reference Harvey1989) the elements of the score vector and the information matrix are given in closed form for the LC-H model according to the expressions:
where tr[·] denotes the trace operator and
and the expectation operator E[·] on the second term in (38) can be dropped (since the expressions are asymptotically equivalent). In order to evaluate the score vector and the information matrix, we need
and
The expressions (39) and (40) require, for t=1, … , T and i=1, … , n
The expressions in (41) in turn require, for t=1, … , T−1 and i=1, … , n
and
Note that $${{\partial m_{0} } \over {\partial \psi _{i} }}\, {\equals} \, {{\partial C_{0} } \over {\partial \psi _{i} }}\, {\equals} \, 0$$ for i=1, … , n and the required differentiation matrices $${{\partial {\mib \bialpha } } \over {\partial {\mib \bipsi } _{i} }}$$ , $${{\partial {\mib \bibeta } } \over {\partial {\bf \psi }_{i} }}$$ , $${{\partial {\bf \biSigma } } \over {\partial {\bf \psi }_{i} }}$$ , $${{\partial {\mib \bitheta } } \over {\partial {\mib \bipsi } _{i} }}$$ and $${{\partial {\mib \bisigma } _{{\mib \biomega} }^{{\bf 2}} } \over {\partial {\mib \bipsi } _{i} }}$$ are displayed in Appendix A. The gradient-based estimation for the LC-H model is described in Algorithm 1.
5. Bayesian State-Space Inference
In contrast to the classical maximum likelihood approach where parameters are deterministic but unknown, in a Bayesian view the parameters are treated as random variables. In this way the Bayesian paradigm can take parameter uncertainty into account and incorporate in a consistent manner a priori beliefs on the important model parameters, as encoded through the prior.
In this section, we aim to develop modern approaches to Bayesian inference for state-space modelling that do not rely on potentially inefficient sampling approaches based on Gibbs or Metropolis-within-Gibbs for the latent state process. Instead we will introduce to stochastic mortality modelling state-space models of two classes of Bayesian inference:
-
∙ Linear Gaussian stochastic mortality models: a Rao-Blackwellised Gibbs sampler approach which is based on a combination of Metropolis-within-Gibbs and Gibbs sampling steps for the static model parameters, combined with a forward-backward Kalman filter recursion for the state process. We assume the proposed constraints (18) throughout, avoiding the constraint issue when performing MCMC as discussed in section 2.4.
-
∙ Non-linear/non-Gaussian stochastic mortality models: in the case of non-linear and or non-Gaussian state-space model dynamics such as the stochastic volatility models of LC3-H2 and the LCSV models, the sampler we develop is based on novel developments of the particle Metropolis–Hastings samplers of Andrieu et al. (Reference Andrieu, Doucet and Holenstein2010) adapted to the stochastic mortality models. In particular, we consider a combination of Rao-Blackwellised Kalman filter and particle filter for the latent state process full posterior conditionals, combined with a combination of Metropolis-within-Gibbs and Gibbs sampling steps for the static model parameters.
In general under all the Bayesian approaches we consider, we aim to obtain the joint posterior density:
of the states is κ 0:T as well as the parameters, $$\bipsi $$ , given the observations y 1:T . We begin with the first case of the linear Gaussian state-space stochastic mortality models and we use the LC-H model as an example where the parameter vector is $${\mib \bipsi } \,\colon\! {\equals} \, \big( {\alpha _{{x_{2} \: \colon\: x_{p} }} ,\beta _{{x_{2} \: \colon\: x_{p} }} ,\theta ,\sigma _{{{\varepsilon},x_{1} \: \colon\: x_{p} }}^{2} ,\sigma _{\omega }^{2} } \big)$$ as we use the constraint proposed in (18).
5.1. Linear Gaussian state-space inference
We develop an efficient approach involving a combined Gibbs sampling conjugate model sampler for the marginal target distributions of the static model parameters along with a forward-backward Kalman filter sampler for the latent process κ 1:T . A sample of the targeted density is obtained via Gibbs sampling where M is the number of MCMC iterations (Algorithm 2).
The general block Gibbs sampling algorithm steps require to sample from the full conditional densities π(κ 0:T | $$\bipsi $$ , y 1:T ) and π( $$\bipsi $$ |κ 0:T , y 1:T ), which are shown in the following.
5.1.1. Sampling from the full conditional density π(κ 0:T | $$\bipsi $$ , y 1:T )
Samples of the full conditional density π(κ 0:T | $$\bipsi $$ , y 1:T ) can be obtained via the so-called forward-filtering-backward sampling (FFBS) procedure (Carter & Kohn, Reference Carter and Kohn1994). This sampling approach is also utilised in Pedroza (Reference Pedroza2006). We can write
where the last term in the product, π(κ T | $$\bipsi $$ , y 1:T ), is distributed as N(m T ,C T ) which is obtained from the last iteration of the Kalman filtering procedure.
Once we draw a sample κ T from N(m T ,C T ), then (45) suggests that we can draw recursively and backwardly κ t from π(κ t |κ t+1, $$\bipsi $$ , y 1:t ), where t=T−1, T−2, … , 1, 0. Moreover, we have
where
which can be derived based on Kalman smoother (Petris et al., Reference Petris, Petrone and Campagnoli2009).
The FFBS procedure is displayed in Algorithm 3. Note that the prior distribution for κ 0 can be set to be vague to run the Kalman filter; the output of the algorithm includes the posterior distribution of κ 0.
5.1.2. Sampling from the full conditional density π( $$\bipsi $$ |κ 0:T , y 1:T )
The first thing to observe is that under the reparameterisation of the identification constraints (18), the following Gibbs sampling stages can be performed exactly.
We assume that the prior for $$\big( {\alpha _{{x_{2} \: \colon\: x_{p} }} ,\beta _{{x_{2} \: \colon\: x_{p} }} ,\theta ,\sigma _{{{\varepsilon},x_{1} \: \colon\: x_{p} }}^{2} ,\sigma _{\omega }^{2} } \big)$$ are given by
where $${\rm IG}\big( {\tilde{a}_{\omega } ,\tilde{b}_{\omega } } \big)$$ denotes an inverse-γ distribution with mean $${{\tilde{b}_{\omega } } /{\left( {\tilde{a}_{\omega } {\minus}\, 1} \right)^{2}}$$ and variance $${{\tilde{b}_{\omega }^{2} } {/{{{\big( {\big( {\tilde{a}_{\omega } \, {\minus}\, 1} \big)^{2} \big( {\tilde{a}_{\omega } \, {\minus}\, 2} \right)} \big)}}} \big. \kern-\nulldelimiterspace} }$$ for $$\tilde{a}_{\omega } \,\gt\,2$$ . We assume that the priors for all parameters are independent. In this case the posterior densities of parameters are of the same type as the prior densities, a so-called conjugate prior. The posterior distribution for each parameter is given by (we write, for ease of notation, y = y 1 :T , κ =κ 0:T and family $$\bipsi $$ −λ means “parameter vector $$\bipsi $$ without the parameter λ”):
5.2. Non-linear/non-Gaussian state-space inference
In the case of non-linear/non-Gaussian state-space model dynamics such as the stochastic volatility models of LC3-H2 and the LCSV models, the sampler we develop is based on novel developments of the particle Metropolis–Hastings samplers of Andrieu et al. (Reference Andrieu, Doucet and Holenstein2010) adapted to the stochastic mortality models. In particular, we consider a combination of Rao-Blackwellised Kalman filter and particle filter for the latent state process full posterior conditionals, combined with a combination of Metropolis-within-Gibbs and Gibbs sampling steps for the static model parameters, both embedded within a PMCMC framework. We will illustrate the idea of this methodology for the LCSV model where a stochastic volatility dynamics is included in the latent process for the period effect.
5.2.1. Estimation for the LCSV mortality model
The static parameter vector is denoted as $${\mib \bipsi } \, {\equals} \, \left( {\alpha _{{x_{2} \: \colon\: x_{p} }} ,\beta _{{x_{2} \: \colon\: x_{p} }} ,\theta ,\sigma _{{\varepsilon}}^{2} ,\sigma _{\gamma }^{2} ,\lambda _{1} ,\lambda _{2} ,\gamma _{0} } \right)$$ . Note that we treat γ 0 as a static parameter and our task is to obtain samples from the joint posterior distribution:
In this setting one can try a number of different approaches, the first would be to sample jointly from the full posterior distribution (54) via PMCMC methods to be described below. A second approach would be to combine PMCMC methods within a block-Gibbs-based sampler such as the following sampling scheme, where we apply Gibbs sampling to sample from the full conditional densities:
Note that sampling from (55a) can be achieved by the FFBS procedure described in Algorithm 3, as one can apply Kalman filtering since γ 1:T is assumed to be given. The only difference compared to section 5.1.1 is that the term $$\sigma _{\omega }^{2} $$ is replaced by exp{γ t } in Kalman filtering.
In the following, we provide details on how to sample from either the full posterior (54) or from full conditionals such as the density in (55c), via PMCMC method. Sampling from the posteriors of the static parameters (55b) is detailed in section 5.2.4.
5.2.2. PMCMC for mortality models
In this section, we explain the generic form of the PMCMC methodology that can be applied in a range of approaches for state-space stochastic mortality models. In general a PMCMC sampling method is a class of MCMC method where SMC algorithm is used as a proposal distribution within a MCMC algorithm. Though this seems trivial, it is actually based on a key observation that by using such a filter within the MCMC, the dimension of the acceptance probability in the Metropolis–Hastings acceptance–rejection stage is significantly reduced and can therefore facilitate much better mixing performance of the resulting Markov chain, reducing variance in estimation, see discussion in detail in Andrieu et al. (Reference Andrieu, Doucet and Holenstein2010).
To bring out the essence of PMCMC, we first discuss a generic approach to sample from a target distribution:
where ϕ 1:T and $$\bipsi $$ are the latent state and static parameters of a general state-space model. Note, the state processes in this context are generally non-linear and potentially non-Gaussian.
From the perspective of obtaining the most efficiently mixing Markov chain to sample from this posterior, the ideal proposal distribution for constructing the Markov chain for $$\left( {{\mib \biphi } '_{{1\: \colon\: T}} ,{\mib \bipsi '} } \right)$$ is easily seen to be given by
where q( $$\bipsi $$ ′| $$\bipsi $$ ) is a proposal for the parameters and the proposal for the latent state, $$p_{{{\mib \bipsi '} }} \left( {{\mib \biphi \prime } _{{1\: \colon\: T}}{\mib z}_{{1\: \colon\: T}} } \right)$$ , is from the state equation (given $$\bipsi $$ ′). Here ( ϕ 1:T , $$\bipsi $$ ) is the current state at MCMC iteration j−1 and $$\left( {{\mib \biphi \prime}\!_{{1\: \colon\: T}} ,{\mib \bipsi '} } \right)$$ is the proposed next move at MCMC iteration j.
In this ideal case, the acceptance probability of this ideal proposal is given by
where r 1Λr 2:=min(r 1, r 2). A desirable property of the ideal proposal is that the acceptance probability depends only on the marginal likelihood, together with the prior and proposal for the static parameters. This is optimal in the sense that the dimension of the numerator and denominator is reduced significantly to the static model parameter dimensions, and not including explicitly the path-space latent process dimensions, a reduction of d×T dimensions for a d-dimensional state vector ϕ t . However, clearly one can never achieve this goal as it requires perfect knowledge of $$p_{{{\mib \bipsi '} }} \left( {{\mib \biphi }{{\mib \prime} }\!_{{1\: \colon\: T}} } \!\mid\!{\mib{z}} _{{1\: \colon\: T}} } \right)$$ as well as the ability to sample this distribution, both of which are unachievable except in the special case of the Linear Gaussian case explained in section 5.1.1.
To circumvent this problem, the particle marginal Metropolis–Hastings sampler (PMMH; Andrieu et al., Reference Andrieu, Doucet and Holenstein2010) applies SMC method to obtain an approximate of the state transition density (which is also the state proposal):
where $$w_{T}^{{(i)}} $$ is the importance weight, $$\delta _{x} (X)$$ a Dirac mass function centred at X and a proposed next move of the latent state is drawn from this discrete approximate distribution. Moreover, a by-product of a SMC algorithm is the marginal likelihood, $$\hat{p}_{{\mib \bipsi } } ({\mib{z}} _{{1\: \colon\: T}} )$$ , which has the following important property:
Lemma 5.1 A SMC proposal admits as a by-product an unbiased estimator of the marginal likelihood p $$_\bipsi $$ ( z 1:T ) given by
where a SMC approximation with N-particles produces, for all t
which is an unbiased particle estimate of p $$_\bipsi $$ ( z t | z 1:t−1). This non-trivial unbiasedness was first presented in Del Moral (Reference Del Moral2004) and has since been utilised to great advantage as explained in Chopin et al. (Reference Chopin, Jacob and Papaspiliopoulos2013). In addition, the variance of this estimator typically only grows linearly with T.
The unbiased approximate marginal likelihoods are then used in the acceptance probability (61):
Due to the unbiasedness of the estimated marginal likelihood, Andrieu et al. (Reference Andrieu, Doucet and Holenstein2010) show that, even though only SMC approximates are used (with finite number of particles N), the invariant distribution of PMMH is the target distribution π( ϕ 1:T , $$\bipsi $$ | z 1:T ).
To apply PMCMC for an efficient estimation of the LCSV model, we first notice that we can obtain explicitly the posteriors of static parameters via conjugate priors. As a result we are only required to sample from the density π(γ 1:T | $$\bipsi $$ , κ 0:n , y 1:T ), instead of the joint density π(γ 1:T , $$\bipsi $$ |κ 0:n , y 1:T ). It turns out that there is a class of PMCMC algorithm, called particle-independent Metropolis–Hastings sampler (PIMH), which provide a mechanism to sample exactly from π(γ1:T | $$\bipsi $$ , κ 0:n , y 1:T ).
Our approach to sampling from the joint posterior distribution, π(k 0:T , γ 1:T, $$\bipsi $$ | y 1:T , of the LCSV model is summarised in Algorithm 4.
5.2.3. PIMH: sampling from π(γ 1:T | $$\bipsi $$ , κ 0:n , y 1:T )
We first note that π(γ 1:T | $$\bipsi $$ , κ 0:n , y 1:T =π $$_\bipsi $$ (γ 1:T |k 0:n ) given the structure of the LCSV model. Using an independent proposal density, q $$_\bipsi $$ (γ1:T |κ0:n ), in the Metropolis–Hastings algorithm, the acceptance probability is given by
Ideally, one may take q $$_\bipsi $$ (γ 1:T |κ 0:n )=π $$_\bipsi $$ (γ 1:T |κ 0:n ). However, in most cases such an ideal choice is impossible to sample from and to evaluate. The PIMH sampler proposes instead to use the SMC approximation $$\hat{\pi }_{\psi } (\gamma _{{1\: \colon\: T}} \!\mid\!\kappa _{{0\: \colon\: n}} )$$ as the proposal density and calculate the acceptance probability as
where $$\hat{\pi }_{{\mib \bipsi } } \left( {\kappa _{{0\: \colon\: n}} } \right){\prime} $$ and $$\hat{\pi }_{{\mib \bipsi } } \left( {\kappa _{{0\: \colon\: n}} } \right)[j\, {\minus}\, 1]$$ are unbiased marginal likelihoods estimated by SMC (see Lemma 5.1) in the current MCMC iteration j and the previous iteration j−1, respectively. It can be shown that the invariant distribution of the PIMH sampler is the target distribution π $$_\bipsi $$ (γ 1:T |κ 0:n ) (Andrieu et al., Reference Andrieu, Doucet and Holenstein2010).
It remains to specify an SMC approximation $$\hat{\pi }_{{\mib \bipsi } } \left( {\gamma _{{1\: \colon\: T}} \!\mid\!\kappa _{{0\: \colon\: n}} } \right)$$ (Appendix B). We use the so-called bootstrap filter, that is, the proposal distribution in the SMC algorithm to draw γ t is given by the state equation (23c):
Consequently, the importance weight is evaluated as
where π(κ t |γ t , κ t−1) is the incremental importance weight. Our approach for sampling from π $$_\bipsi $$ (γ 1:T |κ 0:T ) is summarised in Algorithm 5 (together with Algorithm 6).
5.2.4. Sampling from π( $$\bipsi $$ |κ 0:T , γ 1:T , y 1:T )
We assume the prior for $$\big( {\alpha _{{x_{2} \: \colon\: x_{p} }} ,\,\beta _{{x_{2} \: \colon\: x_{p} }} ,\,\theta ,\,\sigma _{{\varepsilon}}^{2} ,\,\sigma _{\gamma }^{2} ,\,\lambda _{1} ,\,\lambda _{2} ,\,\gamma _{0} } \big)$$ is given by
where x=x 2, … , x p and N [−1, 1] denotes a truncated Gaussian with support [−1, 1]. It is assumed that the priors for all parameters are independent.
Samples from the density π( $$\bipsi $$ |κ 0:T , γ 1:T , Y 1:T ) are obtained by sampling from the following posteriors:
where the posterior distributions are obtained similarly as in section 5.1.2.
6. Empirical Analysis: Danish Male Population
In this section, a comprehensive empirical studyFootnote 4 is conducted on Danish mortality data using the models summarised in Table 2. The LC, LC-H, LCSV and LCSV-H models are described in section 3. While the LC-H model addresses heteroscedasticity in the observation equation, the LCSV model attempts to incorporate stochastic volatility in the state dynamics. The LCSV-H model includes both features of the LC-H model and the LCSV model, thus allowing for a full consideration of variability in long-term mortality dynamics.
Note that we have omitted the LC2-H model (a cohort model) in this analysis as a detailed account of the formulation, Bayesian estimation and forecasting of state-space cohort models is considered in a separate paper, see Fung et al. (Reference Fung, Peters and Shevchenko2017). We are also aware that there are growing interests in the literature to consider more realistic models for the latent factor; for example, Li et al. (Reference Li, Chan and Cheung2011), van Berkum et al. (Reference van Berkum, Antonio and Vellekoop2014) and Liu & Li (2016 Reference Liu and Lib ) all study trend-changing stochastic behaviour for the latent period effects. For this reason we will focus on the LCSV and LCSV-H models, where stochastic volatility is introduced in the latent dynamics, instead of the LC3-H2 model where stochastic volatility enters via the observation equation, in our empirical studies.
The Human Mortality DatabaseFootnote 5 provides a particularly long time series of mortality data from year 1835 to 2011 for the Danish population, supplemented with a detailed document analysing the data (Andreev, Reference Andreev2002). The provision of a long time series is important to our analysis concerning stochastic volatility. In the past several decades, mortality trend for developed countries generally exhibit a rather smooth pattern. The inclusion of periods that involve wars, epidemics or other life-critical events are crucial factors in witnessing significant volatility in mortality time series. In the following, we analyse the population mortality from Denmark based on the models in Table 2 and Bayesian methodologies studied in this paper. We then examine the models in terms of the forecasting properties of death rates and life expectancies. We also comment on the linear trend assumption and jump-off bias in mortality forecasting.
6.1. Data description
The data set consists of Danish male population death rates for 21 age groups (0, 1–4, 5–9, … , 95–99) from year 1835 to 2010 where we fix year 2010 as the end year. Figure 1 displays some of the time series of the log death rates for the Danish male population. It is clear that the multi-dimensional time series exhibit different volatility for different age groups, which justify the introduction of heteroscedasticity into the observation equation as discussed in section 3.1. We also observe that, mainly before 1950, there are periods that the volatility of death rates for some age groups are markedly different. Such a change of volatility in the temporal dimension suggests that stochastic volatility may be present in the underlying time period effect.
6.2. Estimation results
In our empirical study we focus on Bayesian inference and forecasting. We assume vague priors so that all inferences are mainly based on the data and the impact of the prior is not material. Taking the LCSV model as an example, we assume κ 0 $$\sim$$ N(0, 10), α x $$\sim$$ N(0, 10), β x $$\sim$$ N(0, 10), θ $$\sim$$ N(0, 10), $$\sigma _{{\varepsilon}}^{2} \: \sim\: {\rm IG}(2.001,0.001)$$ , $$\sigma _{\gamma }^{2} \: \sim\: {\rm IG}(2.001,0.001)$$ , λ 1 $$\sim$$ N(0, 10), λ 2 $$\sim$$ N(0, 10) and γ 0 $$\sim$$ N(0, 10), where x∈{x 2, … , x p }. The number of iterations of the Markov chain is 15,000 with 5,000 burn-in. We fix $$\alpha _{{x_{1} }} \, {\equals} \, {1 \over T}\mathop{\sum}\nolimits_{t\, {\equals} \, 1}^T y_{{x_{1} ,t}} $$ and $$\beta _{{x_{1} }} \, {\equals} \, 0.2$$ as an identification constraint; see Remark 3.1 for a discussion for our choice of values here.
Estimated values of the static parameters (except α and β ) for the Danish mortality data (1835–2010) are shown in Table 3. The rest of the estimated parameters and states are displayed in Figure 2. Here, we only show the plots for the LCSV-H model since the corresponding figures obtained from the LC, LC-H and LCSV model are visually similar to the case of the LCSV-H model.
The range in (,) represents 95% credible interval.
LC, Lee–Carter; LC-H, LC model with heteroscedasticity; LCSV, LC stochastic volatility; LCSV-H, LCSV model with heteroscedasticity.
It is evident from Figure 2 that there are periods, namely 1850–1870, 1910–1920, 1930–1950, that the time effect κ exhibits higher volatility compared with other periods. We also observe that κ accelerates markedly downward after 1990 and is relatively smooth in the recent period 1950–2010. The filtering of the log-volatility process γ 1835:2010 (Figure 2) quantifies the volatility level $$\left( {e^{{\gamma _{t} }} } \right)$$ of the time effect and gives further evidence on the stochastic volatility nature of mortality. To see more clearly the phenomenon of changing volatility, we plot the first difference $$\Delta \bar{\kappa }_{t} \, {\equals} \, \bar{\kappa }_{t} \, {\minus}\, \bar{\kappa }_{{t\, {\minus}\, 1}} $$ in Figure 2 for the LCSV-H model, where $$\bar{\kappa }_{t} $$ denotes the posterior mean of κ t , t=1836, … , 2010. It shows evidently the change of volatility level in the latent process κ t . The patterns of the estimated log-volatility γ 1835:2010 and the first difference $$\Delta \bar{\kappa }_{t} $$ clearly suggest that it is not appropriate to assume constant volatility $$\left( {\sigma _{\omega }^{2} } \right)$$ for the time effect.
The state-space modelling approach is able to uncover the age-specific heteroscedasticity structure hidden in the Danish mortality time series. Figure 2 reveals that variability is particularly high for the very young and very old age group. Implications of the heteroscedastic structure on forecasting will be discussed in section 6.4.
To investigate the forecasting properties of the stochastic volatility model, we also estimate the models based on calibration periods 1835–1990 and 1950–1990. Figures 3 and 4 show the estimated parameters and states for the LCSV-H model in those periods.
6.3. Model assessment
To compare the fit of the models to the data, we apply deviance information criterion (DIC) as a Bayesian measures of model complexity and fit (Spiegelhalter et al., Reference Spiegelhalter, Best, Carlin and van der Linde2002). It is common to assess and compare models with latent variables using conditional DIC (Berg et al., Reference Berg, Meyer and Yu2004; Celeux et al., 2006). Specifically, we use the so-called conditional log-likelihood which is calculated as
Note that the likelihood is conditional on parameters that include both static parameters and the latent process κ. Using the conditional log-likelihood function, the deviance is defined as
where $${\bf \Psi } $$ =( $$\bipsi $$ , κ 1:T ) and we assume h( y 1:T )=1 since in the models we consider it plays the role of a constant which is the same for competing models. The effective dimension, p D , is evaluated as
where $$\bar{D}({\bf \Psi })$$ and $${\bf \bar{\Psi }}$$ denote, respectively, the mean of D( $${\bf \Psi } $$ ) and the mean of the posterior distribution of $${\bf \Psi } $$ . The conditional DIC is then given by
which can be evaluated straightforwardly using MCMC samples.
The DIC values for the models with different calibration periods are shown in Table 4 Footnote 6 . The LCSV-H model clearly outperforms other models for all calibration periods considered. The LC-H is the second best; in fact it even outperforms the LCSV model for long calibration periods 1835–2010 and 1835–1990. These results suggest that one is strongly encouraged to include heteroscedasticity structure when modelling Danish mortality data. Note that for long calibration periods, the incorporation of stochastic volatility can markedly improve model fit as can be seen by comparing the LCSV-H model and the LC-H model, as well as the comparison between LCSV model and the LC model.
LC, Lee–Carter; LC-H, LC model with heteroscedasticity; LCSV, LC stochastic volatility; LCSV-H, LCSV model with heteroscedasticity.
6.4. Forecasting
In this section, we investigate the forecasting properties of the mortality models summarised in Table 2 where heteroscedasticity as well as stochastic volatility structures are incorporated. Our analysis is based on the forecasting distributions of (log) death rates and life expectancy. The Bayesian state-space framework allows us to obtain the forecasting distributions using MCMC samples which is shown below.
6.4.1. Death rates
For the LC (LC-H) model, the k-step ahead forecasting distribution of y T+k , given y 1:T , is given by
where $$\bipsi $$ is the parameter vector for the LC (LC-H) model; (88) suggests that we can sample recursively to obtain the forecasting distribution, for k≥1, as follows:
where $$\ell \, {\equals} \, 1,\,\ldots\:,L$$ and L is the number of MCMC iterations after burn-in. Here Σ is a diagonal matrix with $$\sigma _{{{\varepsilon},x}}^{2} $$ on the diagonal for the LC-H model and $$\sigma _{{\varepsilon}}^{2} $$ for the LC model. This procedure generates an estimate of the forecasting distribution.
Similarly, the forecasting distribution of y T+k , given y T , for the LCSV (LCSV-H) model is given by
For k≥1, the forecasting distribution can be obtained by sampling recursively
where $$\ell \, {\equals} \, 1,\,\ldots\,,L$$ , and Σ is a diagonal matrix with $$\sigma _{{{\varepsilon},x}}^{2} $$ on the diagonal for the LCSV-H model and $$\sigma _{{\varepsilon}}^{2} $$ for the LCSV model.
Figure 5 shows the forecasted log death rates based on the LC-H, LCSV and LCSV-H model, using the LC model as a benchmark. We show age groups 5–9, 35–39, 65–69 and 95–99 as representatives of young, adult, old and very old age. The models are estimated using data for the period 1835–2010 and forecast for 30 years.
The heteroscedasticity structure, from the LC-H model, gives rise to materially larger forecasting intervals for the young and very old age group, while the forecasting interval for the age group 35–39 is narrower than predicted by the LC model. The LCSV model, on the other hand, produces a wider forecasting interval compared to the LC model except for the very old age group. The observed wider forecasting interval is due to the fact that the volatility level is increasing in the last estimation periods and is larger than $$\sigma _{\omega }^{2} $$ estimated in the LC model. Moreover, as the estimated β x is close to zero at older ages (Figure 2), the impact of the forecasted κ on the prediction of death rates diminished significantly as older ages are considered. The LCSV-H model exhibits similar features of the LC-H and the LCSV model. It is interesting to note that the forecasted means obtained from the different models are very similar and their differences mainly lie in the forecasting interval.
To illustrate further the forecasting property of the LCSV model, we estimate the models for the period 1835–1990 and plot 20-year out-of-sample forecasted log death rates in Figure 6. It turns out the forecasting intervals predicted by the LCSV model tends to be narrower than the LC model, as the estimated $$\sigma _{\omega }^{2} $$ in the LC model is larger than the volatility level at the last estimation period for the LCSV model in this case. Note that the forecasted distributions produced by the LC-H model are biased compared to the benchmark LC model since the fitted rates at the last estimation period, that is year 1990, are different for the LC and LC-H model. This feature is known as jump-off error (Lee & Miller, Reference Lee and Miller2001). One may remove this jump-off bias by forcing the forecasted death rates to start at the actual rates instead of the fitted rates (Bell, Reference Bell1997; Shang et al., Reference Shang, Booth and Hyndman2011). In this paper we do not perform this procedure, however.
Figure 7 shows the forecasting distributions of log death rates where we assume a shorter calibration period from 1950 to 1990. For all the models, the estimated β x for all age groups, except for age groups 0, 1–4 and 5–9, are very close to zero. It is in fact expected since there is no clear downward trend in the observed mortality data besides the first few age groups, during the period 1950–1990. Therefore, there is only small difference between the forecasting distributions produced by the LC model and the LCSV model, except for young age groups. Note that there is a clear change of downward trend for some of the middle age groups for the Danish male mortality data as shown in Figure 7. It results in the out-of-sample data falling out of the lower bound of the 95% credible intervals and its consequence for the forecasting of life expectancy will be discussed in section 6.4.2 and generally in section 6.4.3.
By comparing the forecast performance using in-sample data from 1835 to 1990 and from 1950 to 1990 displayed in Figures 6 and 7, we expose the influence that leaving out important historical events, that may affect the mortality rates markedly in a population, can have on the ability to accurately model trend and volatility structures in population dynamics. In particular, we observe that one must be cautious as forecast performance can degrade markedly when important historical events are excluded from the sample as the forecast using data from 1835 to 1990 has clearly outperformed the forecast using only shorter calibration data from 1950 to 1990.
6.4.2. Life expectancy
Using the samples of the forecasted log death rates $$y_{{x,t}}^{{(\ell )}} \, {\equals} \, {\rm ln}\,\hat{m}_{{x,t}}^{{(\ell )}} $$ , where $$\ell \, {\equals} \, 1,\,\ldots\,,L$$ and L is the number of MCMC samples, we can obtain the so-called period life expectancy at different ages by constructing an abridged life table, since we use age group data, as follows (Koissi et al., Reference Koissi, Shapiro and Hognas2006; Yusuf et al., Reference Yusuf, Martins and Swanson2014). We consider age group x∈{0, 1–4, 5–9, … , 95–99} and $$\tilde{x}$$ is defined as the initial age of age group $$x$$ , that is $$\tilde{x}\in\{ 0,1,5,\,\ldots\,,90,95\} $$ . Define $$n_{{\tilde{x}}} $$ as the length of the interval of age group x (corresponds to $$\tilde{x}$$ ) and hence we have n 0=1, n 1=4, n 5=5, … , n 95=5. We then calculate the (crude) death probabilityFootnote 7 that a person aged $$\tilde{x}$$ in year t will die in the next $$n_{{\tilde{x}}} $$ years as
where $$a\left( {\tilde{x},n_{{\tilde{x}}} } \right)$$ is the average fraction of the $$n_{{\tilde{x}}} $$ years lived by the people who is initially aged $$\tilde{x}$$ in that interval. Using the assumption that deaths are distributed uniformly in the interval, we set $$a(\tilde{x},n_{{\tilde{x}}} )\, {\equals} \, 0.5$$ for every $$\tilde{x}$$ Footnote 8 . The hypothetical number of people alive at age $$\tilde{x}{\plus}n_{{\tilde{x}}} $$ , $$l_{{\tilde{x}{\plus}n_{{\tilde{x}}} ,t}}^{{(\ell )}} $$ , is determined by $$l_{{\tilde{x}{\plus}n_{{\tilde{x}}} ,t}}^{{(\ell )}} \, {\equals} \, l_{{\tilde{x},t}}^{{(\ell )}} \big( {1\, {\minus}\, _{{n_{{\tilde{x}}} }} q_{{\tilde{x},t}}^{{(\ell )}} } \big)$$ , where $$l_{{0,t}}^{{(\ell )}} $$ is assumed to be 100,000. We can then calculate the number of deaths $$_{{n_{{\tilde{x}}} }} d_{{\tilde{x},t}}^{{(\ell )}} \, {\equals} \, l_{{\tilde{x},t}}^{{(\ell )}} \, {\minus}\, l_{{\tilde{x}{\plus}n_{{\tilde{x}}} ,t}}^{{(\ell )}} $$ and the person-years lived $$_{{n_{{\tilde{x}}} }} L_{{\tilde{x},t}}^{{(\ell )}} \, {\equals} \, n_{{\tilde{x}}} \big( {l_{{\tilde{x}{\plus}n_{{\tilde{x}}} ,t}}^{{(\ell )}} {\plus}a(\tilde{x},n_{{\tilde{x}}} ){\times}_{{n_{{\tilde{x}}} }} d_{{\tilde{x},t}}^{{(\ell )}} } \big)$$ . The total future lifetime of the $$l_{{\tilde{x},t}}^{{(\ell )}} $$ persons who attain age $$\tilde{x}$$ is $$T_{{\tilde{x},t}}^{{(\ell )}} \, {\equals} \, \mathop{\sum}\nolimits_{i\geq \tilde{x}} _{{n_{{\tilde{x}}} }} L_{{i,t}}^{{(\ell )}} $$ , where i∈{0, 1, 5, … , 90, 95}. Finally, a sample of the period life expectancy at age $$\tilde{x}$$ is obtained as
and the distributions are obtained in different forecasting year t=T+k, where k≥1.
Remark 6.1 (Period and cohort life expectancy): Period life expectancy assumes there is no trend for future death rates (it is evaluated based on the age-specific death rates in a fixed year t) while cohort life expectancy assumes death rates following the lifetime of a cohort and hence it takes mortality trend into account. For example, to evaluate period life expectancy at age 65 in year t, one needs $$\{ _{5} q_{{65,t}} ,_{5} q_{{70,t}} ,\,\ldots\,,_{5} q_{{95,t}} \} $$ while for cohort life expectancy, $$\{ _{5} q_{{65,t}} ,_{5} q_{{70,t{\plus}5}} ,\,\ldots\,,_{5} q_{{95,t{\plus}30}} \} $$ are used instead. However, the cohort life expectancy for people born in recent years cannot be evaluated using data alone since some of the death rates data are yet to be observed. As a result we focus on period life expectancy so that our forecasts can be compared with the observed data.
Figure 8 shows the forecasted life expectancy at birth, age 65 and age 85 for all the models estimated using data from 1835 to 2010. Interestingly, the forecasted life expectancy at birth is very similar for the LC model and LC-H model for the calibration period 1835–2010. It reflects the fact that forecasting intervals of death rates produced by the LC-H model are wider for some age groups and narrower for other, compared to the LC model, as can be seen from Figure 5. It turns out that these effects almost cancel each other out for this case as death rates are aggregated for all age groups to form the life expectancy at birth distribution. This explanation does not apply to life expectancy at ages 65 and 85, however, since only forecasted death rates for age groups larger than 65 and 85, respectively, are used to obtain the corresponding life expectancy distribution. As the forecasting intervals of death rates generated by the LC-H model tend to be narrower for old age groups compared with the LC model, the interval for the forecasted life expectancy at ages 65 and 85 distribution produced by the LC-H model is observably narrower than the LC model. Note that we only show four different age groups in Figure 5; projected death rates for other age groups are also relevant but are not displayed due to limited space. In contrast to the LC-H model, we can observe from Figure 5 that LCSV model produces generally wider forecasting intervals for death rates at different age groups compared to the LC model. Consequently, the forecasting intervals for life expectancy at various ages generated by the LCSV model is wider than the LC model, as shown in Figure 8. Similarly to the case of death rates forecasting, the LCSV-H model has both the features of the LC-H and LCSV model in terms of life expectancy prediction.
Results for forecasted life expectancy using data from 1835 to 1990 are shown in Figure 9. It is clear that the forecasting intervals for life expectancy at different ages produced by the LCSV model are narrower compared to the LC model, which can be traced back to the fact that the LCSV model predicts narrower forecasting intervals for death rates at various age groups than the prediction by the LC model, see Figure 6. Apparently the jump-off bias are quite different for the LC-H (LCSV-H) model and the LC model. Note also that the “cancel out” effect of aggregated death rates (Figure 6) for the life expectancy at birth for the LC-H model is less prominent for the calibration period 1835–1990 than the period 1835–2010.
As we use the fitted death rates instead of the observed death rates in the jump-off year (i.e. year 2010), there is a jump-off bias in the forecasted death rates. The forecasted life expectancy at age 65 is particularly sensitive to this jump-off bias. It comes from a sudden decline of death rates for age groups larger than 65 beginning in year 1990, hence a significant increase of life expectancy at age 65 is observed. The jump-off bias is significantly smaller when the calibration period 1835–1990 is considered, see Figure 9. As expected, the forecasted distributions of life expectancy at birth and age 65 are similar for all the models estimated using mortality data from year 1950–1990 (Figure 10). Note that the 95% credible intervals capture poorly the out-of-sample data in this case except for the life expectancy at age 85. It is a consequence of the sudden change of significant downward trend for the death rates observed in the middle age groups of the Danish mortality data starting from around 1990, see Figure 7, as well as the jump-off bias. We discuss about the linear trend assumption and jump-off bias in the next section.
6.4.3. Linear trend assumption and jump-off bias
In performing the forecasting of death rates and life expectancies, we use the models summarised in Table 2 where the LC-H, LCSV and LCSV-H models are variants of the LC model in which a linear trend of the period effect is assumed. However, for the Danish male mortality data that we used, the overall trend is reasonably linear for the whole period 1835–2010, but the same may not be said for the shorter period 1950–2010 as Figures 6 and 7 indicate.
In particular, we observe in Figure 7 that there is a clear change of trend for the death rates of middle age groups. Such a change of trend is difficult, if not impossible, to predict in terms of timing and magnitude. We also perform the analysis on French male mortality data and found similar patterns. For forecasting purpose, one may therefore argue that expert opinion will be an important factor in predicting mortality. Even though any change of mortality trend in the short run cannot be predicted with reasonable accuracy using data alone, it can be detected if the instantaneous volatility of mortality is quantified. For example, using the LCSV-H model, the log-volatility γ is quantified and we observe from Figure 2 that γ started to increase around 1990 after several decades of declining. The change of volatility level not only affect the prediction intervals as discussed in previous sections, but also indicates that the change of mortality is heightened and one should be cautious whether a change of trend is taking place.
We also find that jump-off bias, discussed in sections 6.4.1–6.4.2, is an important factor in predicting deaths rates and life expectancies. We note that it is straight forward to remove the jump-off bias by adjusting (89)–(91) so that the actual death rates, instead of the fitted death rates, are used in the beginning of the forecasting period. We do not provide the corresponding plots in the paper but one can envisage the results simply by shifting the forecasted distribution so that the forecasted mean is attached to the (in-sample) data at the end of the estimation year. Removing the jump-off bias will have a significant impact on the accuracy of mortality forecasting especially when data exhibit clear trending.
7. Concluding Remarks
We developed and presented a comprehensive state-space framework for stochastic mortality modelling. The state-space approach has two key advantages. First, it puts modelling, estimation and forecasting of mortality in a unified framework in contrast to common practice in this area. Second, the methodology permits realistic and sophisticated mortality models to be estimated and forecasted, which could be difficult to handled using other approaches.
We show that many of the popular mortality models exist in the literature can be cast in state-space form. We then suggest several classes of mortality models that can be classified as linear Gaussian state-space models and non-linear/non-Gaussian state-space models. Our proposals are not exhaustive but aim to illustrate the flexibility of the methodology. In particular, we incorporate heteroscedasticity and stochastic volatility in mortality modelling, as an examination of mortality data suggests that volatility of death rates is not constant in the age and time dimension over a long time period. Moreover, we propose an alternative identification constraint for the LC type modelling, which is tailored for the state-space approach.
Frequestist state-space inference for stochastic mortality models is carried out and explained based on the gradient and Hessian of the marginalised likelihood developed recently in statistics literature. We also utilise a modern approach to Bayesian inference for state-space modelling hinged on the PMCMC framework. In particular, we develop a sampler using a combination of Rao-Blackwellised Kalman filter and particle filter for the latent state process full posterior conditionals, combined with Gibbs sampling steps for the static model parameters to estimate a stochastic volatility model for mortality proposed in this paper.
Using mortality data of Danish male population, we assess the extended models based on deviance conditional criterion. It is found that incorporating heteroscedasticity is a crucial improvement factor in model fitting, while model complexity is accounted for. The incorporation of stochastic volatility clearly enhances model performance for fitting of long-term mortality time series. Estimation results for long calibration period support the assumption of stochastic volatility. We show that forecasting can be carried out straightforwardly in state-space framework under a Bayesian setting. We examine the forecasting properties of the models using different calibration periods. The inclusion of heteroscedasticity and stochastic volatility substantially affects prediction intervals of death rate and life expectancy distributions. The linear trend assumption commonly found in mortality modelling and jump-off bias are discussed in light of the Danish mortality data.
State-space framework provides attractive features that are of importance to mortality modelling. The methods and results developed and shown in the paper will have significant implications for longevity risk management in actuarial applications. In particular, we anticipate the models and methods introduced in this paper can be employed to longevity risk applications such as the state-space longevity hedging methods proposed in Liu & Li (2016 Reference Liu and Lia ) and Liu & Li (2016 Reference Liu and Lib ) which is a topic of future research.
Acknowledgements
This research was supported by the CSIRO-Monash Superannuation Research Cluster, a collaboration among CSIRO, Monash University, Griffith University, the University of Western Australia, the University of Warwick and stakeholders of the retirement system in the interest of better outcomes for all. This research was also partially supported under the Australian Research Council’s Discovery Projects funding scheme (project number: DP160103489). The authors also acknowledge project “Research on Urban Intelligence” from Research Organization of Information and Systems in Japan.
Appendix A: Differentiation Matrices in Gradient-Based Estimation
For the LC-H model, the parameter vector is denoted by $${\mib \bipsi } \, {\equals} \, \big( {\alpha _{{x_{2} \: \colon\: x_{p} }} ,\beta _{{x_{2} \: \colon\: x_{p} }} ,\sigma _{{{\varepsilon},x_{1} \: \colon\: x_{p} }}^{2} ,\theta ,\sigma _{\omega }^{2} } \big)$$ with dimension n=3p, where p is the number of age group considered. We are required to evaluate $${{\partial {\mib \bialpha } } \over {\partial {\mib \bipsi } _{i} }}$$ , $${{\partial {\mib \bibeta } } \over {\partial {\mib \bipsi } _{i} }}$$ , $${{\partial {\bf \Sigma } } \over {\partial {\mib \bipsi } _{i} }}$$ , $${{\partial {\mib \bitheta } } \over {\partial {\mib \bipsi } _{i} }}$$ and $${{\partial {\mib \bisigma } _{{\mib \omega } }^{{\bf 2}} } \over {\partial {\mib \bipsi } _{i} }}$$ in the gradient-based estimation (section 4.2). Define
Then we have
where δ ij =1 if j=i and zero otherwise; Σ is a diagonal matrix with diagonal $$\sigma _{{{\varepsilon},x_{1} \: \colon\: x_{p} }}^{2} $$ . Note that $${{\partial \left( {\mib \bialpha } \right)_{1} } \over {\partial \left( {{\mib \bipsi } _{\alpha } } \right)_{i} }}\, {\equals} \, {{\partial \left( {\mib \bibeta } \right)_{1} } \over {\partial \left( {{\mib \bipsi } _{\beta } } \right)_{i} }}\, {\equals} \, 0$$ for i=1, … , p−1, where $${\mib \bialpha } \, {\equals} \, \alpha _{{x_{1} \: \colon\: x_{p} }} $$ and $${\mib \bibeta } \, {\equals} \, \beta _{{x_{1} \: \colon\: x_{p} }} $$ .
Appendix B: A Review of SMC Method
SMC, also known as particle filtering, can be viewed as a generalisation of Kalman filtering in state-space modelling context. The method is based on importance sampling and it has become an essential sampling-based tool in many domains (Doucet et al., Reference Doucet, De Freitas and Gordon2001). In the following, we give a brief review of the method using the LCSV model, (23b)–(23c), as an example to derive a basic particle filtering algorithm. Our target density is the joint posterior distribution of the states for stochastic volatility:
where the parameters of the model are assumed to be known and is suppressed here for ease of notation. To apply importance sampling, we first calculate
The importance density is assumed to satisfy
and the importance weight is given by
where $$\hat{w}_{t} $$ is called the incremental importance weight. The normalised importance weights are then obtained as $$w_{t}^{{(i)}} \,\colon\! {\equals} \, {{\tilde{w}_{t}^{{(i)}} } \mathord{\big / { \vphantom {{\tilde{w}_{t}^{{(i)}} } {{\sum}\nolimits_{j\, {\equals} \, 1}^N \tilde{w}_{t}^{{(j)}} }}} \right \kern-\nulldelimiterspace} {\mathop{\sum}\nolimits_{j\, {\equals} \, 1}^N \tilde{w}_{t}^{{(j)}} }}$$ . To summarise, suppose we have N particle paths $$\big( {\gamma _{{1\: \colon\: t\, {\minus}\, 1}}^{{(i)}} ,w_{{t\, {\minus}\, 1}}^{{(i)}} } \big)_{{i\, {\equals} \, 1}}^{N} $$ to approximate the density $$\pi \left( {\gamma _{{1\: \colon\: t\, {\minus}\, 1}} \!\mid\!\kappa _{{0\: \colon\: t\, {\minus}\, 1}} } \right)$$ at time t−1. Then, from (96), the tth particle path at time t is given by $$\gamma _{{1\: \colon\: t}}^{{(i)}} \, {\equals} \, \big( {\gamma _{{1\: \colon\: t\, {\minus}\, 1}}^{{(i)}} ,\gamma _{t}^{{(i)}} } \big)$$ , where $$\gamma _{t}^{{(i)}} $$ is sampled from $$g_{t} \big( {\gamma _{t} \!\mid\!\gamma _{{1\: \colon\: t\, {\minus}\, 1}}^{{(i)}} ,\kappa _{{0\: \colon\: t}} } \big)$$ . The target density π(γ 1:t |κ 0:t ) is approximated by $$\big( {\gamma _{{1\: \colon\: t}}^{{(i)}} ,w_{t}^{{(i)}} } \big)_{{i\, {\equals} \, 1}}^{N} $$ , where the normalised weight $$w_{t}^{{(i)}} $$ is obtained from (97) and normalisation is carried out.
The problem of degeneracy, that is a majority of the particle paths may have negligible weight, can be handled by resampling. Specifically, we define the so-called effective sample size
At each time t, if N eff is smaller than some threshold (e.g. 80% of N) then we draw N samples (denoted by N(i), i=1, … , N) from a multinomial distribution with probability weights $$w_{t}^{{(i)}} $$ , i=1, … , N, and replace the particle paths $$\gamma _{{1\: \colon\: t}}^{{(i)}} $$ by $$\gamma _{{1\: \colon\: t}}^{{(N(i))}} $$ , and set $$w_{t}^{{(i)}} \, {\equals} \, 1\,/\,N$$ . The resampling step allows to keep the particle paths in proportion to their weights and tend to discard those that have negligible weights.