1. Introduction
Covid-19 (The Novel Coronavirus Pneumonia Emergency Response Epidemiology Team, 2020) caused mortality shocks in many countries – Figures 1(b) and (d) show the sharp increase in death counts in England and Wales in 2020 compared to 2019. These shocks were also present in insurer data sets; Richards (Reference Richards2022a) demonstrated Covid-related mortality shocks in annuity portfolios in France, the UK and the USA. The other strong feature of Figures 1(b) and (d) is the inflection point of 2011, but this paper is concerned with outliers, rather than trend changes.
Actuaries typically split mortality bases into two components: (i) current levels and (ii) future improvements. For current mortality levels actuaries use a portfolio’s own recent experience, often using individual lifetimes and multiple covariates; Richards (Reference Richards2022bReference Richards2022b) presents a methodology allowing for Covid-19 when conducting such analysis. However, to obtain a long enough time series for forecasting future improvements, actuaries typically use population data. Such data has a very different structure, namely grouped counts at individual ages with no covariate information beyond separate data for males and females (Macdonald et al., Reference Macdonald, Richards and Currie2018, Chapter 10).
Many insurers use stochastic mortality models calibrated to population data for future-improvement modelling, and thus risk management, solvency and reporting. There is a broad selection of available stochastic mortality models, each with different quantitative and qualitative properties (Cairns et al., Reference Cairns, Blake, Dowd, Coughlan, Epstein, Ong and Balevich2009). However, few of these models are designed to cope with outliers, such as pandemic shock mortality. This is a problem for actuaries, as “the presence of even a few anomalous data points can lead to model misspecification, biased parameter estimation and poor forecasts” (Galeano et al., Reference Galeano, Peña and Tsay2006, p. 654). The 2020 mortality experience constitutes just such a distorting anomaly.
One easy approach is just to ignore the affected data. This is only a short-term solution, however, as the 2020 and 2021 experience will over time move from the trailing edge towards the middle of the exposure period. This “deletion approach” is subjective, and two parties may not always agree on what counts as an outlier; Figures 6 and 7 give an example for M9, where it is by no means obvious from visual inspection that 2020 is an outlier. Indeed, outliers can sometimes be masked (Hadi, Reference Hadi1992), and specific examples of this are given in Section 5 and Figure 8. Finally, there is an insurance-specific aspect, whereby actuaries repeatedly refit models to measure recalibration risk. An important example is for value-at-risk calculations for longevity trend risk under Solvency II (Richards et al., Reference Richards, Currie and Ritchie2014). For such tasks, an automated outlier-detection procedure is required, for which an objective approach is needed.
The layout of the rest of this paper is as follows: Section 2 describes the data set used; Section 3 outlines some other approaches to dealing with mortality outliers; Section 4 discusses a methodology for robustifying forecasts of univariate time indices, such as the Lee-Carter and APC models; Section 5 describes approaches for multivariate random walks, such as the Cairns-Blake-Dowd and Tang-Lee-Tickle model families; Section 6 presents the results of an approach to the 2D $P$ -spline model; Section 7 provides conclusions.
2. Data Description
The data consist of observed deaths in England and Wales, ${d_{x,y}}$ , at age $x$ last birthday in calendar year $y$ , together with corresponding mid-year population estimates, $E_{x,y}^c$ . The data thus lend themselves to modelling the central mortality rate, ${m_{x,y}}$ , without adjustment. Alternatively, the data can be viewed as suitable for modelling the mortality hazard at age $\left( {x + 1/2} \right)$ at time $\left( {y + 1/2} \right)$ .
The data are sourced from the Human Mortality Database (HMD) and we use the subset $x \in \left\{ {50,51, \ldots, 105} \right\}$ and $y \in \left\{ {1971,1972, \ldots, 2020} \right\}$ . We therefore have ${n_x} = 56$ ages and ${n_y} = 50$ years. The age category $x = 105$ technically contains all ages 105 and over, but as the numbers are very low we will treat all those aged 105 and over as being age 105 for simplicity. We alternate our illustrations between males and and females for variety, but both sexes exhibit similar forecasting problems caused by pandemic-affected mortality. Figure 1 shows the marginal death counts, with panels (b) and (d) showing the unusually large number of deaths in 2020 due to the Covid-19 pandemic.
Models in this paper assume that the number of deaths at age $x$ in year $y$ follow a Poisson distribution. With the exception of the 2D penalised-spline model, no allowance is made for over-dispersion in the Poisson counts. With the exception of the Lee-Carter model, models are fitted as penalised constrained generalised linear models using the algorithm of Currie (Reference Currie2013).
3. Some Other Approaches to Outliers
3.1. Ignoring Affected Data
The immediate response by most actuaries was to ignore the 2020 data and to continue using models calibrated up to 2019. This was a practical, short-term approach to a highly unusual situation. However, ignoring the mortality experience of 1 or 2 years is not a permanent solution, and more rigorous approaches are required. This is particularly the case once the affected years move towards the middle of the data series, as is the case at the time of writing.
3.2. Robust Likelihood Estimation
Maximum-likelihood methods can be very sensitive to the presence of outliers in the data. To illustrate, consider an observation, $z$ , supposedly from a N(0,1) distribution. The contribution of this observation to the negative log-likelihood is quadratic:
which means that an outlier has an outsized – and unbounded – influence on estimation. As a result, much early work on time series robustification focused on robustifying the likelihood through use of $\rho $ functions. A $\rho $ function is designed to replace the usual quadratic contribution in Equation (1) with a function that is non-quadratic for extreme values. An ideal $\rho $ function should behave approximately quadratically for observations that are a modest distance from the mean, but it should limit the contribution of extreme values (unlike Equation (1), where an outlier can make an unlimited contribution, thus causing bias). There are many options for such functions, such as in Hampel (Reference Hampel1974) and Maronna et al. (Reference Maronna, Martin and Yohai2006, Section 2.2.4). One early example is the $\rho $ function from Huber (Reference Huber1964):
which is shown in Figure 2 for $k = 1$ . The Huber $\rho $ function is identical to the quadratic $ - \ell $ function within one standard error of the mean, but extreme observations have a linearly increasing contribution to the robustified log-likelihood instead of an exponentially increasing contribution. The extent to which an outlier contributes to the robustified log-likelihood is dependent on the value of $k$ , which requires some judgement.
ARIMA models are a particularly useful representation of stochastic univariate mortality processes, and Martin et al. (Reference Martin, Samarov, Vandaele and Zellner1983) examined in detail how to robustify the conditional log-likelihood function for such models. One benefit of this approach is the ability to calculate a “clean” version of an affected time series, thus allowing (i) the estimation of an outlier effect, and (ii) the calculation of a sensible forecast start point where the most recent observations contain outliers.
However, robustifying the log-likelihood involves trade-offs. One is reduced estimation efficiency: in Figure 2, observations around 1.5–2 standardised deviations away from the mean make less of a contribution to the log-likelihood than they should in theory, thus making less than full use of all available data. Maronna et al. (Reference Maronna, Martin and Yohai2006, Section 5.9.1) presented an “optimal” function that balances robustness and efficiency, but it still does not achieve full efficiency. This is a particular issue for mortality-forecasting work, where actuaries often only use data for the past 50 years or so. A particular concern is efficient estimation of the variance of the innovation process, as this plays a large role in determining value-at-risk capital requirements (Kleinow & Richards, Reference Kleinow and Richards2016, Section 7).
3.3. Weighting
Daneel et al. (Reference Daneel, Bale, Cocevar, Hanlon, Rimmer, Robjohns and Sewell2022) introduced a weighting system for the likelihood, initially with zero weights for pandemic-affected years 2020 and 2021. This was later extended to fractional weights for post-pandemic years in Daneel et al. (Reference Daneel, Bale, Cocevar, Hanlon, Rimmer, Robjohns and Sewell2023). This approach has at least four problems. Firstly, it requires the analyst to decide which years are outliers, as in Section 3.1. While this can be relatively straightforward for univariate time indices, it is far from simple in multivariate cases; see Section 5, in particular Figures 6 and 7.
Second, weights are manually chosen and therefore arbitrary. This is a potential issue for value-at-risk assessments, which involve repeated simulation and refitting of mortality models. Such repeated model recalibrations require objective criteria for dealing with outliers automatically.
Third, weighting the log-likelihood does not give an estimate of the size of the outlier effect. This means weighting cannot provide a clean starting point for a forecast if recent observations are outliers.
The final problem with weighting the log-likelihood is that it does not fully address the bias problem. Consider weighting the log-likelihood contribution by $w \in \left( {0,1} \right)$ . The weighted contribution to the negative log-likelihood function is then $\rho \left( z \right) = {w \over 2}{z^2}$ , which is still an exponentially increasing function of the outlier, $z$ . Thus, weighting observations is not only arbitrary, but it still leaves the distorting potential of severe outliers.
If a model cannot handle a feature of real-world data, then a new model is required. In the following three sections we consider three major classes of stochastic model in common actuarial use and how they can be modified to cope with outliers such as those caused by Covid-19.
4. Univariate Time Indices
4.1. Univariate Mortality Models
Univariate mortality indices are central to several important stochastic projection models, including the model of Lee & Carter (Reference Lee and Carter1992):
and the Age-Period-Cohort (APC) model:
Both models require identifiability constraints in order to be fitted, but both fit and forecast for these models are independent of the choice of linear constraints (Currie, Reference Currie2020). When fitting such models it is useful to smooth ${\alpha _x}$ and ${\beta _x}$ , as this reduces the dimensionality of the models and improves forecasting performance by reducing the risk of crossover at adjacent ages in the forecast (Delwarde et al., Reference Delwarde, Denuit and Eilers2007).
To forecast mortality under the Lee-Carter and APC models we need to forecast ${\kappa _y}$ . We can either use a simple random walk with drift, or a full regression ARIMA model; see Appendix A for the structure and operation of both within an ARIMA framework. Note that a random walk with drift is just an ARIMA(0, 1, 0) model, and that a full ARIMA( $p$ , 1, $q$ ) model often fits ${\kappa _y}$ better (Kleinow & Richards, Reference Kleinow and Richards2016, Table 2). Appendix B.1 considers fitting a random walk with drift in R, while Appendix B.2 considers fitting an ARIMA model around a linear trend.
4.2. Outlier Types
For robust estimation of ARIMA models for univariate forecasting we consider the approach of Chen & Liu (Reference Chen and Liu1993), which contains two elements. First, Chen & Liu (Reference Chen and Liu1993) proposed objective tests to identify where outliers occur. Second, they proposed tests to identify the type of outlier. The ARIMA-robustifying methodology of Chen & Liu (Reference Chen and Liu1993) works for two kinds of forecasting model for ${\kappa _t}$ : either a simple random walk with drift, or a regression ARIMA model. Appendix B.3 shows how an outlier effect is co-estimated with the model parameters once an outlier location has been decided.
Chen & Liu (Reference Chen and Liu1993) presented tests for four types of outlier: additive outlier (AO), innovation outlier (IO), temporary change (TC) and level shift (LS). Stylised illustrations of each of these are given in Figure 3 for a simple moving-average process, ${Y_t}$ , defined as follows:
where ${\varepsilon _t}\sim {\rm{N}}\left( {0,0.1} \right)$ and ${\rm{Cov}}\left( {{\varepsilon _i},{\varepsilon _j}} \right) = 0,\forall i \ne j$ . Figure 3(a) shows the first 50 simulations of such a process using the R code in Appendix A.
In Figure 3(b) an additive outlier of 1 has been added to the uncontaminated process at $t = 30$ with all other observations unchanged. The AO represents an external contamination at a specific point without further downstream consequences. In Figure 3(c) an innovation outlier of 0.5 has been added at $t = 29$ – since an IO is an integral part of the process it also affects the series value at $t = 30$ . In Figure 3(d) a number of consecutive additive outliers of $0.9{\rm{*}}{0.7^{t - 30}}$ have been added at $t \ge 30$ to form a temporary change in the series whose impact diminishes. Finally, in Figure 3(e) the series has a permanent shift in level of +0.9 for $t \ge 30$ .
Chen & Liu (Reference Chen and Liu1993) developed statistical tests to identify outliers and classify them according to type. However, while it is possible to identify an outlier anywhere in the series, “it is impossible to empirically distinguish the type of an outlier occurring at the very end of a series” (Chen & Liu, Reference Chen and Liu1993, p. 286). This has direct relevance to present-day modelling – whilst Covid-19 is a detectable outlier, approaches like that of Chen & Liu (Reference Chen and Liu1993) cannot tell us the nature of the outlier until we have observations after the end of the pandemic.
In mortality work we seek to robustify by identifying and measuring AO, TC and LS outliersFootnote 1 . In contrast, IO outliers are left in as they are part of the underlying process. The rationale for this is that occasional winters of heavy mortality are a recurring feature, so we do not want to exclude them. Furthermore, “least squares estimates are less affected by innovation outliers than by additive outliers” (Martin et al., Reference Martin, Samarov, Vandaele and Zellner1983, p. 6). Also, excluding IOs would lead to underestimation of the variance of the ARIMA innovation process, ${\sigma ^2}$ ; this would be undesirable in actuarial work, since this parameter is a major driver of value-at-risk capital requirements for longevity trend risk (Kleinow & Richards, Reference Kleinow and Richards2016, Section 7).
4.3. Choice of Critical Value
Outliers are determined by Chen & Liu (Reference Chen and Liu1993) with reference to a critical value. This varies by researcher, but many have opted for a critical threshold of 3 standardised deviations (Maronna et al., Reference Maronna, Martin and Yohai2006, p. 6); with a normally distributed sample, only around 0.3% of observations should be that far away from the mean.
Chen & Liu (Reference Chen and Liu1993, Section 3) carried out extensive tests of critical values between 2.25 and 3.5 using simulated series with 100 observations. However, the mortality patterns of the 1920s and 1930s are not generally pertinent to modern actuarial work, so actuaries typically use time series of much shorter lengths, around 50 years. As a result, it is often best to use a critical threshold greater than 3. Consider the example of a Lee-Carter model fitted to the mortality of females in England & Wales over 1971–2020. The best-fitting ARIMA model for ${\kappa _y}$ using the Chen & Liu (Reference Chen and Liu1993) outlier-identification approach leads to the results shown in Table 1.
Using a critical threshold of 3, the methodology of Chen & Liu (Reference Chen and Liu1993) finds outliers in 10 out of 50 years, which is excessive. This suggests that for univariate mortality models a higher critical threshold is required. In contrast, using a critical threshold of 3.5 for females in England and Wales identifies the one expected outlier in 2020. This is supported by the test results for simple AR(1) and MA(1) models in Chang et al. (Reference Chang, Tiao and Chen1988, Tables 1 and 2), where a critical value of 3.5 also worked well for a series length of 50. Once again, there is also a specific actuarial reason to use a higher critical threshold – Table 1 shows that excluding too many observations leads to an artificially low estimate of ${\sigma ^2}$ . Since ${\sigma ^2}$ is a main driver of value-at-risk capital requirements for longevity trend risk (Kleinow & Richards, Reference Kleinow and Richards2016, Section 7), too low a critical value would lead to under stated capital requirements.
A critical value of 3.5 performed well in identifying an AO of five standardised deviations in Chang et al. (Reference Chang, Tiao and Chen1988, Table 4), and the $t$ -values in Table 1 suggest that the Covid-19 mortality in 2020 was also a five-sigma event, i.e. an event five standard deviations away from the mean. However, a true five-sigma event might occur a handful of times every million years, and yet the two mortality spikes due to Covid-19 in 2020 and 2021 were no worse than the two mortality spikes due to influenza in 1918 and 1919 (Richards, Reference Richards2022b, Figure 2). Two “five-sigma” events in 100 years suggests that the model is wrong (either the Lee-Carter structure or the ARIMA assumption is incorrect) or that ${\hat \sigma ^2}$ underestimates the true variance of the innovation process. Either way, it is a reminder that value-at-risk methodologies calibrated to 50 years of data can only set a lower bound for the capital required.
Figure 4 shows the impact of robustifying the regression ARIMA model for forecasting. According to Martin et al. (Reference Martin, Samarov, Vandaele and Zellner1983, p. 2) “outliers can (lead to) incorrect model identification (…) and seriously impede the construction of forecasts based upon these historical data.” We see an example of this in Figure 4(b), where the unrobustified forecast for males is rendered nonsensical due to the bias in the likelihood caused by Covid-19 mortality. However, the robustified forecast has a more sensible starting point and direction.
5. Multivariate Time Indices
In this section we look at stochastic mortality models that forecast using a multivariate random walk with drift. There are two broad families: (i) Cairns-Blake-Dowd models, which assume a Gompertz mortality pattern with age and (ii) Tang-Li-Tickle models, which use Hermite splines to allow for slower mortality increases at advanced ages. Both families are well suited to actuarial work because they naturally extrapolate mortality rates to ages higher than the maximum age of the calibrating data set.
5.1. Cairns-Blake-Dowd Models
Table 2 gives an overview of four CBD models. The missing model, M8, is excluded as it tends to produce unstable forecasts (Cairns et al., Reference Cairns, Blake, Dowd, Coughlan, Epstein and Khalaf-Allah2011, Section 6), although as a three-dimensional random walk with drift it could be robustified in the same way as M7 and M9.
The parameter naming convention in Table 2 is kept consistent with Cairns et al. (Reference Cairns, Blake and Dowd2006) and Cairns et al. (Reference Cairns, Blake, Dowd, Coughlan, Epstein, Ong and Balevich2009), where the various $\kappa $ terms form the bivariate and trivariate random walks with drift for forecasting. However, it is worth emphasising that these are dependent parameters, and that their values and role depend on the other parameters in the model. This is particularly the case for M9, where the age term, ${\alpha _x}$ , changes the shape, scale and nature of the $\kappa $ terms compared to the other three models.
The models in Table 2 are all fitted as Generalised Linear Models (GLMs) with a Poisson assumption for the number of deaths at each combination of age and year (Currie, Reference Currie2016). We ignore the over-dispersed nature of the death counts, as over-dispersion does not bias estimated means. Djeundje & Currie (Reference Djeundje and Currie2011) discuss dispersed mortality counts in more detail.
Many stochastic mortality models require identifiability constraints (Currie, Reference Currie2020), but M5 does not require any. M6, M7 and M9 contain a cohort term, ${\gamma _{y - x}}$ , which ordinarily requires two or more identifiability constraints. However, following Richards et al. (Reference Richards, Currie, Kleinow and Ritchie2019, Appendix B) we do not estimate cohort terms with fewer than four observations, which means no identifiability constraints are required for M6 or M7. This contrasts with the implementation in Cairns et al. (Reference Cairns, Blake, Dowd, Coughlan, Epstein, Ong and Balevich2009), where two identifiability constraints were used for M6 and three for M7. If we drop cohort terms with fewer than four observations, the only CBD model requiring identifiability constraints is M9, which needs three linear constraints. Here we use the following for M9:
However, there are nevertheless circumstances when we might want to impose more constraints than are mathematically necessary for identifiability. One such reason is to obtain parameter estimates that are forecastable. Figure 5 shows some parameter estimates under minimal identifiability constraints, and these patterns are not consistent with a multivariate random walk with drift. In this paper we therefore over-constrain our M9 models – Figure 6 shows that additional constraints on ${\gamma _{y - x}}$ , although mathematically unnecessary for identifiability, do nevertheless produce parameter estimates more consistent with the forecasting assumption of a multivariate random walk with drift. Currie (Reference Currie2020) provides an extensive treatise on identifiability constraints in linear models, including tests for determining the minimum number of identifiability constraints required.
For forecasting we define $\kappa $ as an ${n_y} \times p$ matrix composed of the relevant parameters in Table 2. For example, for the M9 model using data for 1971–2020 we would have the following $50 \times 3$ matrix:
We will confine ourselves to robustifying the multivariate random walk element of models M5, M6, M7 and M9, which allows the forecasting of existing cohorts. The projection of ${\gamma _{y - x}}$ under M6, M7 and M9 does not require robustification. We note that such outlier cohorts as have been identified in raw data (Cairns et al., Reference Cairns, Blake, Dowd and Kessler2015) are largely dealt with by the method protocols used by the HMD. See Boumezoued (Reference Boumezoued2021) for details on how outlier cohorts arise from sudden shifts in birth distribution, and how fertility data can be used to improve exposure estimates.
5.2. Tang-Li-Tickle Models
A new class of stochastic mortality models was introduced by Tang et al. (Reference Tang, Li and Tickle2022). Instead of the linear Gompertz assumption in CBD models, TLT models use Hermite splines:
where ${x_0}$ is the minimum modelled age and ${x_1}$ is the maximum modelled age. Often these are set to the minimum and maximum age of the calibrating data set, but they can lie outside this range. For example, although the maximum age in the England and Wales data set in this paper is 105, we can set ${x_1} = 120$ to extrapolate mortality rates to this age. With the Hermite basis splines we can define the Tang-Li-Tickle family of models in Table 3.
The models in Table 3 are all fitted as GLMs and do not require identifiability constraints. The TLT family is new, and so has not yet been extended to include cohort terms. Like CBD models, TLT models are useful actuarially because they can extrapolate fitted mortality rates beyond the highest age of the calibrating data set (Richards, Reference Richards2023, Figure 1).
As with the CBD models in Table 2, the models in Table 3 are also forecast using multivariate random walks with drift: HS1 and HS5 are bivariate, HS2 and HS3 are trivariate and HS4 is tetravariate. In practice there is often such a weak or non-existent trend in some of these variables that it makes sense to impose a common value across time. This is done for HS5, which is a version of HS2 with ${\omega _y} = \omega $ , i.e. turning a trivariate forecast into a bivariate one. HS5 is essentially the Hermite-spline analogue of M5 in Table 2.
For forecasting we define $\kappa $ as an ${n_y} \times p$ matrix composed of the relevant parameters in Table 3. For example, for the HS2 model using data for 1971–2020 we would have the following $50 \times 3$ matrix:
5.3. Outlier Detection for Multivariate Random Walks
Forecasting for the CBD and TLT families is performed as a $p$ -dimensional random walk with drift. This means that the row differences in $\hat \kappa $ between any two consecutive years are assumed to have a multivariate normal distribution with mean vector $\mu $ and covariance matrix ${\boldsymbol{\Sigma }}$ , neither of which vary based on the years in question, i.e. ${\rm{\Delta }}\hat \kappa \sim {\rm{MVN}}\left( {\mu, {\rm{\;}}{\boldsymbol{\Sigma }}} \right)$ .
The detection of outliers in multivariate data can be tricky – “it is quite possible for data to be outliers in multivariate space, but not outliers in any of the original univariate dimensions” (Hadi et al., Reference Hadi, Rahmatullah Imon and Werner2009, p. 57). A possible example of this is shown for M9 in Figure 6 – there is no trace of an outlier in ${\hat \kappa _{1,2020}}$ or ${\hat \kappa _{2,2020}}$ , with only a weak suggestion of a possible outlier in ${\hat \kappa _{0,2020}}$ . This is an interesting contrast to the univariate example in Figure 17, where the outlier in 2020 is quite clear.
Since model parameters are dependent on each other, the real question is whether the triplet $\left( {{{\hat \kappa }_{0,2020}},{{\hat \kappa }_{1,2020}},{{\hat \kappa }_{2,2020}}} \right)$ is an outlier? Figure 7 shows a scatterplot of ${\rm{\Delta }}\hat \kappa $ , which further illustrates the difficulty of using visual inspection – is 2020 the outlier, or is 1977? Or are there any outliers at all?
One approach to multivariate random walks is to use the Mahalanobis distance, ${D_j}$ , for a $p$ -dimensional observation, ${z_j} = {\rm{\Delta }}{\kappa _j}$ :
Figure 8 plots the distance measures for ${\rm{\Delta }}\hat \kappa $ . If there are no outliers, and ${z_j}$ has a normal distribution, then ${D_j}\sim \chi _p^2$ , where $p$ is the number of columns in $\hat \kappa $ . The upper 5% quantile of $\chi _3^2$ is 7.815, so at face value it looks like 2020 is not an outlier for females in England and Wales under M9.
However, Hadi et al. (Reference Hadi, Rahmatullah Imon and Werner2009, p. 60) noted that the “Mahalanobis distance is not robust, as it is affected by masking and swamping.” Masking is the phenomenon whereby an outlier is hidden because it inflates the estimate of variance used to detect outliers. Swamping is the phenomenon whereby non-outliers have a large Mahalanobis distance because an outlier has distorted the mean of the process. We therefore have an example of masking in Figure 8 because the estimate ${\boldsymbol{\widehat \Sigma }}$ used in Equation (14) has been distorted by the presence of the outlier we suspect is there. This is an issue for insurers in particular, since a distorted $\widehat {\Sigma}$ will likely lead to excessive capital requirements. Hadi (Reference Hadi1992) presented an approach to identifying multivariate outliers, later updated in Hadi (Reference Hadi1994), which used robust measures for the mean and covariance matrix to minimise the risk of masking and swamping.
More recently, Galeano et al. (Reference Galeano, Peña and Tsay2006) presented a methodology for outlier detection for multivariate data using projection pursuit; see also (Huber, Reference Huber1985) for an early introduction to this topic. Like the Mahalanobis distance, projection pursuit reduces a multidimensional problem to a univariate one. Galeano et al. (Reference Galeano, Peña and Tsay2006) presented a methodology that sought a mapping from a vector ARMA (VARMA) model to a univariate ARMA one, and furthermore sought mappings that maximised and minimised the kurtosis coefficient. This is done because the kurtosis coefficient, being the fourth moment, is even more sensitive to outliers than the quadratic function of Equation (1). Illustrations of applying Galeano et al. (Reference Galeano, Peña and Tsay2006) to some M9 models are given in Figure 9.
Galeano et al. (Reference Galeano, Peña and Tsay2006) considered far more complicated VARMA and VARIMA models with more dimensions and longer time series than actuaries face with CBD and TLT models; Appendix C examines some of the limitations of this methodology when applied to shorter time series with fewer dimensions, which are more typical of stochastic mortality work.
6. 2D Penalty Projections
The last of our three classes of projection is the penalty method of the 2D age-period (2DAP) penalised-spline model of Currie et al. (Reference Currie, Durban and Eilers2004). This 2DAP model was extended to include period shocks by Kirkby & Currie (Reference Kirkby and Currie2010). In the models fitted in this section we further extend Kirkby & Currie (Reference Kirkby and Currie2010) to allow for over-dispersion.
Kirkby & Currie (Reference Kirkby and Currie2010, Equation 2.9) defined the 2DAP model as a generalised linear array model (GLAM):
where $\boldsymbol M$ is the matrix of mortality rates indexed by age in the rows and by calendar year in the columns, ${\boldsymbol E}^{c}$ is the corresponding matrix of population exposures, ${{\boldsymbol B}_{a}}$ is a $B$ -spline basis matrix for age, ${{\boldsymbol B}_y}$ is a $B$ -spline basis matrix for time, and ${\bf{\Theta }}$ is a matrix of regression coefficients.
Fitting the GLAM is done by expressing Equation (15) in vector form:
where the ${\rm{vec}}()$ function stacks the columns of a matrix into a single vector, $ \otimes $ is the Kronecker product and $\boldsymbol \theta = {\rm{vec}}\left( {\bf{\Theta }} \right)$ . In the fitting procedure, the coefficients in ${\bf{\Theta }}$ are simultaneously smoothed in the age and time dimensions by suitable penalty functions. The problem with a mortality shock is that smoothing in the time dimension is no longer sensible. Kirkby & Currie (Reference Kirkby and Currie2010) therefore extended the model in Equation (16) to include period shocks as follows:
where ${{\boldsymbol I}_{{n_y}}}$ is the identity matrix for ${n_y}$ years, ${{\boldsymbol {\breve B}}_s}$ is a $B$ -spline basis in age and $\boldsymbol {\breve \theta}$ is a vector of shock coefficients. ${{\boldsymbol {\breve B}}_s}$ and ${\boldsymbol B}_a$ are both basis matrices in age, but ${{\boldsymbol {\breve B}}_s}$ typically has fewer knots than ${{\boldsymbol{B}}_a}$ . When applied to the mortality of females in England and Wales, Equation (17) reveals the period shocks shown in Figure 10.
Figure 10 shows that the model of Kirkby & Currie (Reference Kirkby and Currie2010) not only picks up the Covid-19 mortality shock of 2020, but also various minor period shocks since 1971. However, for robustification we are arguably less interested in the minor shocks and would prefer to smooth the most trivial ones towards zero. Kirkby & Currie (Reference Kirkby and Currie2010) therefore proposed an extension whereby the amount of smoothing varies by a scaling factor that is defined by a given year’s shock size relative to the largest shock. Thus, if the basic smoothing parameter is ${\lambda _s}$ , the scaled smoothing parameter for year $i$ , ${\lambda _i}$ , is then:
Figure 11 shows the inverse relative scaling factors ( ${\lambda _s}/{\lambda _i}$ ) for simple scaling with $\alpha = 3.24$ . The factor ${\lambda _s}/{\lambda _i}$ is small for all years relative to 2020, showing that all years bar 2020 have heavy smoothing applied.
Equation (18) gives rise to four smoothing options: (i) no scaling ( $\alpha = 0$ ), as used in the model behind Figure 10; (ii) simple scaling ( $\alpha = 1$ ); (iii) fixed scaling, where $\alpha $ is set to a given value; and (iv) optimised scaling, where $\alpha $ is set by minimising an information criterion such as the BIC. Figure 12 shows the period shocks under optimised scaling, which shows that minor period effects are smoothed towards zero, leaving a focus on the most significant (and otherwise most distorting) shocks. One result is that the magnitude of the 2020 shock is larger in Figure 12 than in Figure 10. Another feature of Figure 12 is that the 2020 shock reduces in size with increasing age. This has parallels with the age-dependent shocks shown in Kirkby & Currie (Reference Kirkby and Currie2010, Figure 5), where the excess mortality of the 1919 influenza pandemic was dramatically higher at ages 20–40 than at post-retirement ages. The reducing shock mortality with increasing age in 2020 contrasts with the pattern of increasing excess mortality with age for the minor shocks in Figure 12. The minor shocks are therefore not just smaller in magnitude, but also qualitatively different.
The use of the 2D period-shock model removes the distorting influence of the Covid-19 pandemic. This produces more sensible forecasts without the undue influence of outliers, as shown in Figure 13.
7. Conclusions
The Covid-19 mortality shock of 2020 created outliers in the population mortality data of many countries. These outliers bias parameter estimates in mortality projection models, thus affecting central forecasts and value-at-risk assessments of insurer capital requirements for longevity risk. The short-term solution of ignoring the 2020 and 2021 experience works only as long as those years are at the trailing edge of the data set. As the Covid-affected years move towards the middle of the time series, other approaches are needed, especially objective procedures that can be automated for value-at-risk assessments.
Determination of outliers by visual inspection is unreliable, as how an outlier affects parameter estimation is model-dependent. Simplistic solutions like weighting the log-likelihood are similarly subjective, while also failing to insulate parameter estimates from the bias caused by extreme values. We therefore require objective procedures to (i) identify outliers and – where possible – classify them, (ii) estimate outlier effects alongside forecasting parameters to reduce bias, and (iii) calculate a robust starting point for forecasts where recent observations contain outliers. We outline such objective methods for three classes of stochastic mortality-forecasting models: (i) the approach of Chen & Liu (Reference Chen and Liu1993) for univariate mortality indices, (ii) the approach of Galeano et al. (Reference Galeano, Peña and Tsay2006) for multivariate mortality indices, and (iii) the approach of Kirkby & Currie (Reference Kirkby and Currie2010) for 2D smoothed models. In each case we illustrate how these methods co-estimate parameters and outlier effects, resulting in robust forecasts.
Acknowledgements
The author thanks Gavin Ritchie and Torsten Kleinow for helpful comments on earlier drafts of this paper. Any errors or omissions remain the responsibility of the author. Calculations were performed using the Projections Toolkit (Longevitas Development Team, 2024) and bespoke programs written in R (R Core Team, 2021). Graphs were produced in tikz (Tantau, Reference Tantau2024), pgfplots (Feuersänger, Reference Feuersänger2015) and R, while typesetting was done in LaTeX.
Disclosure
The author is a director of Longevitas Ltd, and has a financial interest in the Projections Toolkit.
Appendices
Appendix A. R Code to Simulate Outlier Types
Appendix B. AR(I)MA Models for Univariate Mortality Indices
B.1. ARMA Models for Differences
From an outset year, $y$ , we denote by ${\kappa _{y + t}}$ the value of a univariate mortality index in year $y + t,t \ge 0$ . Let $\left\{ {{X_t}} \right\}$ be the stochastic process of the first differences of $\left\{ {{\kappa _{y + t}}} \right\}$ , i.e. ${X_t} = {\rm{\Delta }}{\kappa _{y + t}}$ .
We define an autoregressive moving-average (ARMA) model for ${X_t}$ as follows:
with autoregressive parameters $\left\{ {{\phi _i},i = 1, \ldots, p} \right\}$ and moving-average parameters $\left\{ {{\theta _i},i = 1, \ldots, q} \right\}$ . $\left\{ {{\varepsilon _t}} \right\}$ is a white-noise process where each ${\varepsilon _t}$ is i.i.d. N(0, ${\sigma ^2}$ ). Equation (B1) is for an ARMA( $p$ , $q$ ) model without a mean. An ARMA( $p$ , $q$ ) model for a process $\left\{ {{X_t}} \right\}$ with a mean $\mu $ is defined as follows:
$\mu $ in Equation (B2) is a constant drift term and represents the long-term tendency for ${\kappa _{y + t}}$ to change roughly linearly, albeit with potentially long meanders around the linear trend caused by the autoregressive and moving-average parameters.
We further define the backshift operator, $B$ , such that ${B^i}{\varepsilon _t} = {\varepsilon _{t - i}}$ for $i \ge 1$ . ( $B$ is sometimes described as the lag operator and denoted $L$ (Harvey, Reference Harvey1981, p. 26).) For conciseness we can define polynomials in $B$ as $\phi \left( B \right) = 1 - {\phi _1}B - \cdots - {\phi _p}{B^p}$ and $\theta \left( B \right) = 1 + {\theta _1}B + \cdots + {\theta _q}{B^q}$ . Different authors use different signing conventions – see for example Martin et al. (Reference Martin, Samarov, Vandaele and Zellner1983, p.5) – but the signing used here is the same as used in R’s arima() function. We can then rewrite Equation (B2) more compactly:
We will concern ourselves only with stationary processes, which impose bounds on the permissible values taken by parameters (see Harvey (Reference Harvey1981, pp.28–35)). We illustrate the estimation of the parameters in Equation (B3) using the univariate mortality index in Table B.1 and illustrated in Figure B.2. The R command in Figure B.1 fits an ARMA(1, 2) model with a mean to the first differences of a variable kappa by maximum-likelihood, where the diff() function calculates the first differences and the arima() function fits the model. The output stage in Figure B.1 shows ${\hat \phi _1}$ as ar1, ${\hat \theta _1}$ as ma1, ${\hat \theta _2}$ as ma2 and ${\hat \sigma ^2}$ as sigma2. Somewhat confusingly, the mean, $\hat \mu $ , is labelled intercept in R’s output, a topic we return to in Section B.2.
B.2 Regression ARIMA Models
Equation (B3) is a model for $\left\{ {{X_t}} \right\}$ , the first differences of $\left\{ {{\kappa _{y + t}}} \right\}$ . This is equivalent to a regression ARIMA model for ${\kappa _{y + t}}$ . Such models are termed REGARIMA models by Maronna et al. (Reference Maronna, Martin and Yohai2006, p. 300). The regression ARIMA model can be fitted to the $\kappa $ variable directly by specifying an external regressor vector containing $1,2, \ldots, {n_y}$ as shown in Figure B.3.
A comparison of Figures B.1 and B.3 shows that an ARMA( $p$ , $q$ ) model with a mean for the first differences of $\left\{ {{\kappa _{y + t}}} \right\}$ is equivalent to a linear regression model for $\left\{ {{\kappa _{y + t}}} \right\}$ with ARIMA( $p$ , 1, $q$ ) errors. The drift estimate, $\hat \mu $ , is labelled intercept in Figure B.1, but in Figure B.3 it is the coefficient of the external regressor vector 1:ny, i.e. the slope of the assumed linear trend in ${\kappa _{y + t}}$ . It is somewhat less than helpful that the same drift term, $\hat \mu $ , variously goes by the label “mean,” “intercept” and “slope,” but the labels stem from the model context. For example, we can re-arrange Equation (B3) as follows:
If we start at ${\kappa _y}$ , the $t$ -steps-ahead forecast, ${\kappa _{y + t}}$ , is then ${\kappa _y}$ plus the cumulative sum of the next $t$ differences:
from which we can see that ${\kappa _{y + t}}$ is composed of a linear trend, ${\kappa _y} + t\mu $ , plus the addition of a complex ARMA error process. Thus, the mean (intercept) of the ARMA process for $\left\{ {{X_t}} \right\}$ becomes the slope of the linear trend for $\left\{ {{\kappa _{y + t}}} \right\}$ in the regression ARIMA model. For a regression ARIMA( $p$ , 1, $q$ ) model, Equation (B3) can therefore be written directly in terms of ${\kappa _{y + t}}$ as:
where ${\mu _t} = t\mu $ . Note that it is common to fit ARIMA models by recasting them in state-space form and applying a Kalman filter; see Martin et al. (Reference Martin, Samarov, Vandaele and Zellner1983, Section 5.1) or Harvey (Reference Harvey1981, Chapter 3). Indeed, once the model is in state-space form it would be possible to allow ${\mu _t}$ to vary more flexibly in time as a full dynamic linear model (Petris et al., Reference Petris, Petrone and Campagnoli2009), but this is beyond the scope of this paper.
B.3 Regression ARIMA models with outliers
We now consider the fitting of a regression ARIMA model for $\left\{ {{\kappa _{y + t}}} \right\}$ where there is contamination with one or more outliers. Table B.2 is equivalent to Table B.1 but with the addition of an extra year’s data. Figure B.4 shows the outlier in 2020 caused by the first Covid-19 shock in April and May 2020.
The outlier in 2020 has distorted the model fit in Figure B.5 considerably, as all parameter estimates have taken very different values compared to Figure B.3. To fit a regression ARIMA model for $\left\{ {{\kappa _{y + t}}} \right\}$ with allowance for an outlier in 2020 we define a matrix of external regressors, XREG. The first column of XREG is the linear trend 1 : ny as before, whereas the second column is an indicator variable taking the value 1 in 2020 and 0 in all other years. Further outliers for other years can be added similarly as additional columns for XREG. The R commands to fit this outlier-aware model are shown in Figure B.6, along with the output.
In Figure B.6 the estimated drift term, $\hat \mu $ , now appears as XREG1, while the estimate of the outlier effect of the 2020 Covid-19 mortality appears as XREG2. Note that the estimates ${\hat \phi _1}$ , ${\hat \theta _1}$ , ${\hat \theta _2}$ , $\hat \mu $ . and ${\hat \sigma ^2}$ are little changed between Figures B.3 and B.6, showing that co-estimation of the outlier effect along with the ARIMA parameters eliminates the distortion demonstrated in Figure B.5.
Finally, for forecasting we need to robustify the starting point. In the case of the data in Table B.2, the cleaned value is ${\hat \kappa _{2020}}$ minus XREG2, i.e. $ - 0.1669 - 0.0631 = - 0.2300$ .
Appendix C. Multivariate Outlier Detection in R
R offers an implementation of Galeano et al. (Reference Galeano, Peña and Tsay2006) in the outliers.hdts() function in the SLBDD package. This is intended for multivariate data generated by a variety of vector ARMA (VARMA) and vector ARIMA (VARIMA) processes, potentially with a large number of dimensions. In this appendix we look at the performance of the outliers.hdts() function when applied to the multivariate random walks in the CBD and TLT families of stochastic mortality models.
Galeano et al. (Reference Galeano, Peña and Tsay2006, Table 2) provide the empirical critical thresholds for eight specimen models for time series of various lengths. One issue for actuaries is that the shortest time series tested has 50 observations, whereas typically this would be close to the maximum relevant length for actuarial work. A second issue for actuaries is that none of the eight models tested corresponds to a multivariate random walk with drift. Finally, the outliers.hdts() function uses a hard-coded critical value derived from the square root of a power of the upper 5% value of a $\chi _1^2$ distribution. We are therefore interested in the performance of the outliers.hdts() function with the shorter series of fewer dimensions that are likely to be encountered in actuarial work.
We fit M5 and M7 models to the mortality experience of males aged 50–105 in England & Wales over the period 1971–2019. The estimated parameters of the bivariate random walk with drift for ${\hat \kappa _{0,y}}$ and ${\hat \kappa _{1,y}}$ in the M5 model are:
and the estimated parameters of the trivariate random walk with drift for ${\hat \kappa _{0,y}}$ , ${\hat \kappa _{1,y}}$ and ${\hat \kappa _{2,y}}$ in the M7 model are:
We simulate the random walks with drift for a term of 50 years, first by simulating the differenced series (the VARMA process) then calculating the cumulative sum (the VARIMA process). We then call outliers.hdts() for both the differenced and undifferenced series using the full 50-observation series, but also for the first 25, 30, 35, 40 and 45 values to check the function performance with shorter series. The proportions of simulated series erroneously identified as containing outliers (the false positives) are shown in Table C.1.
Table C.1 shows that the outliers.hdts() function has a radically different performance based on whether the differenced or undifferenced series is used. The false-positive rate for the differenced series is typically smaller than the hard-coded 5% rate, and is in fact smaller than this for mortality series of length 40–50 years. On the basis of this limited assessment, the unmodified outliers.hdts() function looks appropriate for application to the multivariate random walks for CBD and TLT stochastic mortality models, but only when passed the differenced series. In practice, one can edit the source code for outliers.hdts() and related functions to vary the p-value from the hard-coded 5%. However, Table C.1 shows that the false-positive rates of outliers.hdts() vary strongly, with undifferenced bivariate random walks particularly at risk of greater false positives than the programmed p-value implies.
Despite its restricted functional interface, the outliers.hdts() function is useful for robustifying multivariate mortality indices. The function returns a cleaned version of the indices, which can be used to calculate robust estimates of the forecasting parameters while also providing robustified starting points for the forecast. Estimates of the multivariate outliers can be obtained by simply deducting the cleaned series from the original series.