1. Introduction
The use of machine learning in insurance pricing has flourished in recent years and the related literature is increasing accordingly. For example, Guelman (Reference Guelman2012) adopts gradient boosting for auto insurance cost modelling; Spedicato et al. (Reference Spedicato, Dutang and Petrini2018) performs insurance pricing optimisation using several machine learning models; Henckaerts et al. (Reference Henckaerts, Côté, Antonio and Verbelen2021) use tree-based techniques such as GBM in order to produce car insurance tariffs; Schelldorfer & Wüthrich (Reference Schelldorfer and Wüthrich2019) employ neural networks to enhance GLMs performances in non-life insurance and Wüthrich (Reference Wüthrich2020) proposes two different techniques to overcome the unbiasedness of neural network models for insurance portfolios. Machine learning techniques have also found extensive application in the context of insurance claim reserving: Gabrielli et al. (Reference Gabrielli, Richman and Wüthrich2020) improve the performances of over-dispersed Poisson model for general insurance claims reserving by means of neural networks embedding, and Wüthrich (Reference Wüthrich2018) propose several machine learning algorithms for individual claim reserving.
However, such techniques only give an estimation of the expected value of the chosen variable (claim frequency or claim severity), since they are designed to return the pure premium of a specific policy, where the expected value of the total claim amount is given by the product between the expected values of claim frequency and claim severity. Hence, these models, even if they offer insight into the average loss of a policy, are unable to provide the modeler with some valuable information about its potential riskiness, e.g. the quantile of the total claim amount. To overcome this problem a quantile regression approach, originally introduced by Koenker & Bassett (Reference Koenker and Bassett1978), may be considered since it provides information on the whole distribution of a given phenomenon. The quantile regression technique represents a robust distribution-free methodology that has been widely used in the financial literature to compute risk measures like the value-at-risk (see for example Taylor, Reference Taylor2007; White et al., Reference White, Kim and Manganelli2015; Laporta et al., Reference Laporta, Merlo and Petrella2018; Adrian & Brunnermeier, Reference Adrian and Brunnermeier2016; Petrella & Raponi, Reference Petrella and Raponi2019; Taylor, Reference Taylor2020, and Merlo et al., Reference Merlo, Petrella and Raponi2021b).
The quantile regression approach appears particularly suitable in the insurance context, when assessing the Solvency II capital requirements and calculating the premium safety loadings. Furthermore, it enables the insurer to soundly gauge the portfolio riskiness (i.e. computing the Value-at-Risk of a given portfolio). Modeling the quantile claim amount through the quantile regression (QR) has already been discussed by a handful of authors: Kudryavtsev (Reference Kudryavtsev2009) was the first introducing the use of the two-stage QR model to estimate the quantile of the total claim amount; Heras et al. (Reference Heras, Moreno and Vilar-Zanón2018) propose a refinement of the previous model since they take into account heterogeneous claim probabilities, whereas Kudryavtsev (Reference Kudryavtsev2009) only considers a single probability of having claims for each type of policyholder; Baione & Biancalana (Reference Baione and Biancalana2019) propose an alternative two-stage approach, where the risk margin considered in the ratemaking is calibrated on the claim’s severity for each risk class in the portfolio, avoiding some of the drawbacks that characterise the technique proposed by Heras et al. (Reference Heras, Moreno and Vilar-Zanón2018).
The common QR method requires the specification of a predetermined dependence structure between the dependent variable and the covariates or to elaborate its complex functional form to account for non-linearity or interactions among regressors. Unfortunately, the structural form of the dependence is often unknown to the modeler. So, a different approach should be pursued in this context. Neural networks appear to be an interesting modeling technique overcoming these limitations since they can fit a complex data structure without any a priori assumption on the relation among variables.
In this paper, we propose two innovative methods to estimate the conditional quantile of the total claim amount for a group of health policies. The first one uses the quantile regression neural network (QRNN), a particular specification of a feed-forward neural network originally introduced by Taylor (Reference Taylor2000), able to estimate conditional quantiles. This model, up to our knowledge, has never been used in the context of insurance ratemaking.
The second model we propose considers a new extension of the combined actuarial neural network (CANN) proposed by Schelldorfer & Wüthrich (Reference Schelldorfer and Wüthrich2019) in a quantile regression framework. The original CANN formulation is devoted to claim frequency estimation and combines a Poisson GLM with a neural network. In our model, since we are interested in the conditional quantile of the total claim amount, we nest the QR model into the structure of a neural network (Quantile-CANN henceforth). This approach is able to represent additional information incorporated in the data and not captured by the simple QR model.
The structure of the approach here considered is based on a two-part model. A simple, but common and useful version of such model involves a model for a binary indicator variable and a model for the response variable given that the binary indicator takes the value one. Following this approach, we fit a logistic regression for the binary variable to estimate the claim probability while we use a QRNN or a Quantile-CANN to model the quantile of the positive outcome. Using the estimated quantiles of the claim amount, we finally calculate a loaded premium following the quantile premium principle considered in Heras et al. (Reference Heras, Moreno and Vilar-Zanón2018).
To verify the predictive ability of the proposed models, we conduct an empirical analysis using an Italian health insurance dataset, where we compare the performances of the QR, QRNN, and Quantile-CANN. The results highlight that QRNN and Quantile-CANN exhibit better performance in terms of the quantile loss function compared to the classical QR. In particular, we notice that the Quantile-CANN architecture is always able to enhance the estimates given by the QR model. We then exploit the estimates provided by the different models in the calculation of a quantile based insurance premium using the quantile premium principle (QPP) aforementioned. The analysis shows that premiums produced by the Quantile-CANN and QRNN provide a better risk diversification for the portfolio.
The remainder of the paper is structured as follows: Section 2 explores the two-part model considered in this work. Section 3 introduces the use of QRNN and of Quantile-CANN for the estimation of the quantile of the total claim amount. Section 4 presents the empirical application carried out on an Italian health insurance claim dataset. Section 5 concludes.
2. The Two-Part Quantile Model
In this Section, we discuss the two-part quantile regression framework, first considered by Heras et al. (Reference Heras, Moreno and Vilar-Zanón2018), devoted to conditional quantile estimation of the aggregate claim amount $S_i$ at level $\tau$ and for a specific policyholder i. Commonly, two-part models involve a mixture distribution consisting in mixing a discrete point mass, with all mass at zero, and a continuous random variable. In particular, they are described by two equations: a binary choice model is fitted for the probability of observing a positive-versus-zero outcome. Then, conditional on a positive outcome, an appropriate regression model is fitted for the continuous outcome. The common structure of such models assumes that the effect of the covariates influence the mean of the conditional distribution of the response. However, in many real-world applications like the actuarial ones, the effect of the covariates can be different on different parts of the distribution. For instance, the gender of the policyholder may be irrelevant when modeling the average claim severity, while it may be strongly significant when studying the upper quantiles of the claim severity. This idea is supported by the results reported in Heras et al. (Reference Heras, Moreno and Vilar-Zanón2018), where the authors show that the significance of a specific categorical variable may vary across quantiles. For this reason, a quantile regression approach in the two-part model may be appropriate.
In this paper, we focus our attention on the effect of covariates on the quantile of the aggregate claim amount $S_i$ . In order to build the two-part quantile model (see Duan et al., Reference Duan, Manning, Morris and Newhouse1983; Frees, Reference Frees2010, and Merlo et al., Reference Merlo, Maruotti and Petrella2021a), we consider the indicator random variable $\mathbb{I}_{N_{i}}$ measuring whether the policyholder has zero claims or positive claims (where $N_{i}$ is the number of claims submitted by the policyholder). If $N_i>0$ then a positive aggregate claim severity $\tilde{S}_{i}$ is observed.
Consistently with this approach, given a set covariates $\boldsymbol{x}_{i} $ , we model the $\tau$ -th conditional quantile of the total claim amount $Q_{S_{i}}(\tau|\boldsymbol{x}_{i})$ in two stages:
-
the first stage allows to estimate the no claim probability $p_{i}=Pr(\mathbb{I}_{N_{i}}=0)=Pr(N_{i}=0)$ as function of covariates. To achieve this goal, we use the logistic regression:
(1) \begin{equation}\log\left(\frac{1-p_{i}}{p_{i}}\right)=\boldsymbol{x}_i^\prime\boldsymbol{\beta},\end{equation}where the no claim probability is obtained as $p_{i}=\frac{1}{1-exp(\boldsymbol{x}_i^\prime\boldsymbol{\beta})}$ ; -
The second stage uses $p_{i}$ to obtain the $\tau^{\star}_{i}$ conditional quantile level of $\tilde{S}_{i}$ , $Q_{\tilde{S}_{i}}(\tau^{\star}_{i}|\boldsymbol{x}_{i})$ , corresponding to the $\tau$ quantile level defined on the total claim amount $S_{i}$ , $Q_{S_{i}}(\tau|\boldsymbol{x}_{i})$ . Following Heras et al. (Reference Heras, Moreno and Vilar-Zanón2018), the $\tau^{\star}_{i}$ level can be calculated as
(2) \begin{equation}\tau^{\star}_{i}=\frac{\tau-p_{i }}{1-p_{i}}\end{equation}for which(3) \begin{equation}Q_{\tilde{S}_{i}}(\tau^{\star}_{i}|\boldsymbol{x}_{i})=Q_{S_{i}}(\tau|\boldsymbol{x}_{i}).\end{equation}
In the literature, $Q_{\tilde{S}_{i}}(\tau^{\star}_{i}|\boldsymbol{x}_{i})$ is generally calculated using the well known quantile regression approach of Koenker & Bassett (Reference Koenker and Bassett1978). In this paper, we will generalised this approach by introducing two alternative methods, the QRNN, a particular specification of Regression Neural Network introduced by Taylor (Reference Taylor2000), and the Quantile Combined Actuarial Neural Network (Quantile-CANN) which is a new method to calculate quantiles embedding the CANN approach of Schelldorfer & Wüthrich (Reference Schelldorfer and Wüthrich2019) in a quantile regression framework. These approaches allow performing quantile estimation without imposing any predetermined structure for the relations between the claim severity and the related covariates.
3. Quantile Claim Severity Models
The standard approach to tackle the problem of quantile claim severity estimation refers to traditional QR models, see for example Kudryavtsev (Reference Kudryavtsev2009), Heras et al. (Reference Heras, Moreno and Vilar-Zanón2018) and Baione & Biancalana (Reference Baione and Biancalana2019). In this paper, we generalise this approach by proposing the use of neural network models to estimate the conditional quantile of the aggregate claim severity. This approach allows performing these calculations without imposing any predetermined structure for the relations between the aggregate claim severity and the related covariates. In this way, we can capture nonlinear and complex patterns in the data and possible interactions between predictors.
The first model we introduce is QRNN, which is basically a feed-forward neural network minimising the same quantile loss function minimised by a QR model. The second one is a new model that generalises the CANN approach introduced by Schelldorfer & Wüthrich (Reference Schelldorfer and Wüthrich2019) by combining the QR model with a QRNN to improve the model’s performance, we call this model Quantile-CANN.
Since all the aforementioned models are somehow based on quantile regression, we give a light insight on QR models in the next subsection.
3.1. Quantile regression
Quantile regression, originally introduced by Koenker & Bassett (Reference Koenker and Bassett1978), is a distribution free method providing a way to model the conditional quantiles of a response variable with respect to a set of covariates in order to have a more robust and complete picture of the entire conditional distribution with respect to the classical mean regression. Quantile regression approach is quite suitable method used in all the situations where specific features, like skewness, fat-tails, outliers, truncation, censoring, and heteroscedasticity arise. In this section, we show how to calculate $Q_{\tilde{S}_{i}}(\tau^{\star}_{i}|\boldsymbol{x}_{i})$ using QR standard tools, in particular, we analyse how to calculate the quantile of the $\textrm{log}(\tilde{S}_{i})$ . The use of the logarithmic function derives from the need to transform the dependent variable from $\mathbb{R}^{+}$ to $\mathbb{R}$ to apply the QR approach. Moreover, the use of the logarithmic transformation is coherent within the insurance pricing context, since it allows to consider multiplicative tariffs. In addition, by considering the equivariance under monotone transformation property of the quantile, it is possible to retrive the quantity of interest, i.e. $Q_{\textrm{log}(\tilde{S}_{i})}(\tau^{\star}_{i}|\boldsymbol{x}_i )=\textrm{log}(Q_{\tilde{S}_{i}}(\tau^{\star}_{i}|\boldsymbol{x}_i ))$ . For notational simplicity, hereafter, we will use $\tau^{\star}$ instead of $\tau^{\star}_{i}$ in the formulas below.
The quantile regression model can be stated as follows:
where $\boldsymbol{\beta}(\tau^{\star})=(\beta_1(\tau^{\star}),\beta_2(\tau^{\star}),\dots,\beta_{q_{0}}(\tau^{\star}))\in\mathbb{R}^{q_{0}}$ is the vector of unknown regression parameters and $\varepsilon_i$ having the $\tau^{\star}$ -th conditional equal to zero for all $i=1,2,\dots,I$ . The estimation of the of the regression parameters $\boldsymbol{\beta}(\tau^{\star})$ can be obtained by solving the following minimisation problem:
where $\rho_{\tau^{\star}}$ is the quantile loss function defined as
with $\mathbb{I}_{(u)}$ being the indicator function. The conditional quantile of $\tilde{S}_{i}$ is then estimated as $\hat{Q}_{\tilde{S}_{i}}(\tau^{\star}|\boldsymbol{x}_i)=\textrm{exp}(\boldsymbol{x}_i^\prime\hat{\boldsymbol{\beta}}(\tau^{\star}))$ .
In what follows we explore the model proposed based on the neural network approach.
3.2. Quantile Regression Neural Networks
QRNN is a modeling technique introduced by Taylor (Reference Taylor2000) based on the neural networks that enables to estimate the conditional probability distribution of multiperiod financial return within a quantile regression framework. With this approach, it is possible to estimate potential non-linear quantile relations without imposing any distributional assumption or functional relations between dependent and independent variables. Several applications of the methodology have already been implemented in different fields, see for instance Xu et al. (Reference Xu, Deng, Jiang, Sun and Huang2017) and Cannon (Reference Cannon2011). Up to our knowledge, the QRNN methodology has never been considered in an insurance pricing context.
At a deeper insight QRNN model, for a fixed depth $K\in\mathbb{N}$ , and fixed $\tau^{\star}$ , follows a typical feed-forward structure:
where the output of the network is given by the exponential activation function applied to the scalar product between the readout parameter vector $\boldsymbol{\theta}^{(K+1)}$ returning one neuron in the output layer and the composition of the different K hidden layers $\boldsymbol{z}^{(1)}(\boldsymbol{\theta}^{(1)}),\cdots, \boldsymbol{z}^{(K)}(\boldsymbol{\theta}^{(K)})$ , where $\boldsymbol{\theta}^{(1)},\cdots, \boldsymbol{\theta}^{(K)}$ are the parameters belonging to each layer.
The generic s-th hidden layer $\boldsymbol{s}^{(s)}(\boldsymbol{\theta}^{(s)})$ of dimension $q_{s}\in\mathbb{N}$ is defined as
where the j-th neuron in the s-th hidden layer is given by
where $\phi$ is the activation function and $\boldsymbol{\theta}_{j}^{(s)}=({\theta}_{j,0}^{(s)},{\theta}_{j,1}^{(s)},\dots,{\theta}_{j,q_{s-1}}^{(s)})'$ is the vector of parameters belonging to j-th neuron in the s-th hidden layer. Considering the vector of parameters $\boldsymbol{\theta}_{1}^{(s)},\dots,\boldsymbol{\theta}_{q_{s}}^{(s)}$ for each neuron in (8), we can define the matrix of parameters for the s-th hidden layer as $\boldsymbol{\theta}^{(s)}=(\boldsymbol{\theta}_{1}^{(s)},\dots,\boldsymbol{\theta}_{j}^{(s)},\dots,\boldsymbol{\theta}_{q_{s}}^{(s)})'$ of dimension $q_{s}\times (1+q_{s-1})$ Footnote 1 .
Since the network in (7) has K hidden layers, it is possible to denote with $\boldsymbol{\theta}$ the full set of parameters for the network gathering the matrix of parameters of each layer:
with dimension r, where $r=\sum_{s=1}^{K+1}q_{s}(1+q_{s-1})$ .
To obtain the optimal set of parameters $\hat{\boldsymbol{\theta}}$ for (10), we train the network (7) minimising the quantile loss function:
where we fix the starting value of the network parameter $\boldsymbol{\theta}_{0}$ at the beginning of the training.
From (7) and (11), we estimate $\hat{Q}_{\textrm{log}(\tilde{S_{i}})}(\tau^{\star})$ , then given the equivariance to monotone transformation of the quantile function we retrive $\hat{Q}_{\tilde{S}_{i}}(\tau^{\star})=\textrm{exp}(\hat{Q}_{\textrm{log}(\tilde{S}_{i})}(\tau^{\star}))$ .
It is worth noting that if we replace the exp and the $\phi$ activation functions in (7) and in (9) with the linear activation function and we consider no hidden layers, then the QRNN boils down to the classical QR model in (5). For a visualisation of the QRNN architecture, see Figure 1.
3.3. Quantile-CANN
The innovative model we propose in this section to estimate the quantiles for the distribution of interest is an extension of the combined actuarial neural network (CANN) approach proposed by Schelldorfer & Wüthrich (Reference Schelldorfer and Wüthrich2019) and launched in the editorial of Wüthrich & Merz (Reference Wüsthrich and Merz2019). In particular, the CANN framework nests a generic regression model into the neural network architecture to enhance the estimates given by the generic regression model. Following the CANN approach, but in a quantile framework, we boost the QR model with the neural network features proposing the so-called Quantile-CANN model. Our approach allows for exploiting the quantile neural networks to improve the conditional quantile estimates given by a QR model. In such a way, the neural network directly improves the classical QR estimates, preserving the information contained therein.
The main advantage of the Quantile-CANN approach compared to the QRNN is that it combines the flexibility of the networks with the interpretability of a QR approach, so providing higher explainability. According to Wüthrich & Merz (Reference Wüsthrich and Merz2019), when the reference regression model is already close to optimal, its maximum likelihood estimator can be used as initialisation of the neural network fitting algorithm, we use the QR estimator to initialise the network parameter of the Quantile-CANN, then obtaining lower computational time for the network parameters calibration than the QRNN model.
Formally, we define the Quantile-CANN as
where the first term of the right hand side of (12) refers to the QR model in (4) with vector of parameters $\boldsymbol{\beta}(\tau^{\star})$ , while the second term is the QRNN model displayed in (7) (except for the missing exponential activation in the output layer). Therefore, the Quantile-CANN model combines the models discussed in the two previous subsections (3.1 and 3.2) by embedding the QR into the network architecture using a skip connection that links the input given by the QR estimates to the output layer (see Figure 2 for a graphical representation), where the models are merged by summing the two parts as in (12). The network parameter of the Quantile-CANN model is denoted by $\boldsymbol{\vartheta}$ and consists of
The optimal set for the network parameter $\hat{\boldsymbol{\vartheta}}$ of (13) is obtained training the Quantile-CANN (12) minimising the quantile loss function:
The optimisation process to estimate $\boldsymbol{\vartheta}$ in (13) works as follows: we first obtain $\hat{\boldsymbol{\beta}}(\tau^{\star})$ parameters minimising the quantile loss function for the quantile in (5). Then we use such parameters to initialise the network parameter of the Quantile-CANN by considering $\boldsymbol{\vartheta}_{0}=\left \{\hat{\boldsymbol{\beta}}(\tau^{\star}),\boldsymbol{\theta}_{0}\right \}$ , where $\boldsymbol{\theta}_{0}$ is the starting value of the network parameter belonging to the QRNN part of model (12). Therefore, starting from $\boldsymbol{\vartheta}_{0}$ , we optimise $\boldsymbol{\vartheta}$ minimising (14) by means of the gradient descent algorithmFootnote 2 . During the optimisation process, both $\boldsymbol{\theta}_{0}$ and $\hat{\boldsymbol{\beta}}({\tau^{\star}})$ parameters are trained.
Given the optimal set of parameters $\hat{\boldsymbol{\vartheta}}$ , from (12), we estimate
then, the quantile of the claim severity is obtained as $\hat{Q}_{\tilde{S}_{i}}^{CANN}(\tau^{\star})=\textrm{exp}(\hat{Q}_{\textrm{log}(\tilde{S}_{i})}^{CANN}(\tau^{\star}))$ .
Note that if the Quantile-CANN model does not return any improvement with respect to the QR model it means that the latter is already able to capture all the relevant information incorporated in the data.
4. Case Study
4.1. Data description
To assess the ability of the models proposed to capture additional information incorporated in the data, we carry out an empirical case study based on an Italian health insurance claim dataset. The data stem from an Italian insurer and report the claims collected in a general health insurance plan during 2018. The plan provides risk coverage for managers and retired managers belonging to companies of a specific industry in Italy. More specifically, the dataset consists of 132, 499 policyholders (employees or former employees). Since each policyholder can also enroll his relatives (spouse and children below 25 years of age) in the insurance coverage, we totally have 301,405 insured. For each one of them, the following information is available: a binary variable signaling whether the insured submitted at least a claim during the year, total claim severity per year, age, gender, region, firm dimension, and time of coverage permanence (in years). Table 1 provides a summary for the information available in the dataset.
Among the insured, we count a total of 217,006 claimants, the monetary volume of the submitted claims is about euro 243 m.
In Figure 3 and Table 2, we report the histograms and the frequency tables for the variables in the dataset. The age (AG) variable is strongly concentrated at older ages, we notice a “dip” in the distribution between 25 and 40 years, this is due to the specific subscription policies of the Italian insurer: since the policyholders are managers, it is unfrequent they have less than 40 yearsFootnote 3 moreover, they are not allowed to enroll in the insurance coverage their children above 25 years of age. That’s why we observe only a small number of insured between 25 and 40 years. The gender (GE) is rather balanced between the two classes. The permanence (PE) is an integer variable that reports the years of permanence in the insurance plan, minimum is 0 (for new comers) and maximum is 41 (for early adopters). Observing the plot in Figure 3, we notice a decreasing trend; however, there is strong peak around 38 years, probably due to the subscription of the policy by a consistent number of companies in the industry. For the region (RE), we observe a strong concentration in 2 of 21 Italian regions: “Lombardia” and “Lazio”, where most firms have their head office. Dimension (DM) is the number of managers of the company to which the policyholder belongs, the value of this variable is the same across all the insured belonging to the same family and ranges from 1 to 1,500, with a strong concentration below 100, which represents the small to medium sized firms (that are specific to the Italian economy).
In the right bottom panel in Figure 3, we also plot the histogram of the aggregate claim severity for the 217,006 claimants. We recall that the aggregate claim severity is defined as the sum of the cost of all claims submitted by a given claimant. As it is common in this field, the distribution is right-skewed, from the plot we observe the existence of some large claims; however, the tail doesn’t seem to be particularly fat.
4.2. Evaluate model performance
To assess the general performance of the QR, QRNN, and Quantile-CANN models, we use the aforementioned dataset at different $\tau^{\star}$ levels. Specifically, the performance of each model is measured in terms of quantile loss function (see Equation (6)), where the lower the loss the better the model.
The results reported in Table 3 are obtained by means of five-fold cross validation, where the dataset is divided in five folds and at each iteration, three of the five folds are used as training set, one as validation and one as testing set. The network models are trained over 2, 000 epochs using early stopping on the validation set to avoid overfitting. For the early stopping, we employ a patience parameter of 200 epochs; hence, the training stops when the validation loss does not decrease for at least 200 epochs. When training stops, the network weights obtained at the 200th last epoch are restored and saved as the optimal parameters for the model. Both models adopt a three hidden layer structure of dimension (20, 15, 10), we consider the hyperbolic tangentFootnote 4 activation function for QRNN, while we use the ReLuFootnote 5 activation function for Quantile-CANN. As for the variables presented in Table 1: AG, PE, DM are Min-Max scaled, GE is dummy encoded and RE is treated using a $d=1$ embedding layer. As for the QR model, we consider a splines function to model the AG and the PE effects.
For each model, we estimate the conditional quantile of the total claim amount at levels $\tau^{\star}=(0.7, 0.75, 0.80, 0.85,0.9)$ and compute the respective in sample and out of sample quantile loss function to evaluate their performance.
For our dataset, QRNN and Quantile-CANN exhibit an overall better performance in terms of the quantile loss function compared to the classical QR for each quantile level $\tau^{\star}$ (see Table 3). It is interesting to note that Quantile-CANN always yields a lower quantile loss function compared to the QRNN, suggesting that building the network around the QR has not only improved the performance given by the QR model but also beats the QRNN. The results above illustrated seem to suggest that using a neural network approach can be appealing. However, such models often face one major drawback: the lack of explainability. Indeed, neural networks have a huge number of parameters and a complex inner structure made up of several hidden layers, that make make very difficult for the modeler to understand the results. To overcome such limitations, in recent years, a wide literature covering the topic of model agnostic tools has flourished, see Friedman & Popescu (Reference Friedman and Popescu2008a) and for an actuarial case study Lorentzen & Mayer (Reference Lorentzen and Mayer2020) or Henckaerts et al. (Reference Henckaerts, Côté, Antonio and Verbelen2021), aiming at providing interpretative tools for machine learning models.
The corresponding literature has a plethora of well established model agnostic techniques. In this work, we exploit the permutation variable importance of Breiman et al. (Reference Breiman, Friedman, Stone and Olshen1984) to gauge the relevance of each covariate in the model; the ICE curves and the PD plots to showcase the marginal effect of a covariate over the prediction produced by the model; the H-squared statistic Friedman & Popescu (Reference Friedman and Popescu2008a) in order to spot potential interaction effects between the covariates.
4.3 Variable importance
Permutation variable importance of Breiman et al. (Reference Breiman, Friedman, Stone and Olshen1984) measures the increase in the deviance of a model after permuting the values of a given covariate. The basic idea is quite simple: the importance of a covariate is measured by calculating the increase in the model’s deviance after permuting the covariate. In other terms, the variable importance is the increase in model deviance when the information provided by an explanatory variable is destroyed. Such variable is deemed important if randomly shuffling its values increases the model deviance, because in this case the model relied on the feature for the prediction. While, a covariate is not important if shuffling its values produces little to no increase in the model’s deviance
In what follows the general framework for the algorithm.
Let f be the generic prediction function given by a model, where $\boldsymbol{x}$ is the feature matrix, y is the vector of observations and D(y,f) is the model’s deviance. For instance in QR we have $f=\textrm{exp}(\boldsymbol{x}^\prime\hat{\boldsymbol{\beta}}(\tau^{\star}))$ and $D(y,f)=\rho_{\tau^{\star}} (y-f)$ .
-
1. We estimate the original model Deviance $D_{0}=D(y,f(\boldsymbol{x}))$ ;
-
2. For each covariate $j=1,...,p$ :
-
take the covariate matrix $\boldsymbol{x}_{i}$ , that can also be represented as $\boldsymbol{x}_{i}=\left \{\boldsymbol{x}_{.,j},\boldsymbol{x}_{.,-j} \right \}$ , where $\boldsymbol{x}_{.,j}$ is the column vector belonging to covariate j and $\boldsymbol{x}_{.,-j}$ is the matrix for the other covariates. Get the set of covariates $\boldsymbol{x}^{perm}$ by permuting the values in $\boldsymbol{x}_{.,j}$ ;
-
estimate model deviance $D_{perm}=D(y,f(\boldsymbol{x}^{perm}))$ ;
-
compute the Permutation Variable Importance for covariate j as $I_{j}=D_{perm}-D_{0}$
-
3. Sort covariates by descending order $I_{j}$ for $j=1,...,p$ .
Permutation variable importance can be either performed on the training set or the test set. Performing the analysis on the test set informs the modeler on how much the model counts on each covariate for the predictions, while using the test set would give a hint on the relevance of the covariate for the performance of the model on new and unseen data.
In Figure 4, we report the variable importance metric to find the most relevant variables in our dataset for the quantile models. The variables are ranked from top to bottom, starting with the most important one as measured by the variable importance. For all models, the most important variable is the Age (AG) followed by the spatial variable (RE), the other variables are far less relevant; however, they seem to have a somewhat higher importance in network models with respect to the QR.
4.4 Main effects
ICE profiles are a useful tool to study the marginal effect of a covariate over the response provided by the model. Such a profile, for a given covariate j, shows how the prediction provided by the model for an observation $obs_{i}=\boldsymbol{x}_{i}$ reacts when the covariate $\boldsymbol{x}_{i,j}$ slides over its range of possible values.
In particular, producing ICE profiles for a set of observations gives a hint on how the response evolves with respect to the different values of the variable. An ICE plot represents the relationship between the prediction and a specific covariate for each single observation separately, producing one profile (or line) per observation. The values for a line of the data matrix is obtained by fixing the values of all other covariates and creating variants of this observation by replacing the covariate’s value with values coming from a grid and producing predictions with the model for these new observations. This procedure, for a specific observation, results in a set of points given by the feature value from the grid and the respective predictions. From an algorithmic standpoint, we have:
-
1. take an observation $obs_{i}=\boldsymbol{x}_{i}$ and the corresponding prediction $f(\boldsymbol{x}_{i})$ given by the model.
-
2. $\boldsymbol{x}_{i}$ is a p dimensional vector that can also represented as $\boldsymbol{x}_{i}=\left \{\boldsymbol{x}_{i,j},\boldsymbol{x}_{i,-j} \right \}$ , where j is the covariate we want to study.
-
3. Consider the grid of V possible values $(\textrm{v}_{1},\textrm{v}_{2},\dots,\textrm{v}_{V})$ for the selected covariate j.
Then for each $\textrm{v}_{k}$ with $k=1,...,V$ we repeat:
-
$\boldsymbol{x}_{i,j}=\textrm{v}_{k}$ .
-
$obs_{i}=\boldsymbol{x}_{i}=\left \{\textrm{v}_{k},\boldsymbol{x}_{i,-j} \right \}$ .
-
$ICE_{i,\textrm{v}_{k}}^{j}=f(obs_{i})$ .
Once the process ends, we obtain a curve $\left \{ICE_{i,\textrm{v}_{k}}^{j}\right \}_{k=1}^{V}$ corresponding to $\left \{\textrm{v}_{k},\boldsymbol{x}_{i,-j} \right \}_{k=1}^{V}$ .
The process is repeated potentially for each observation in the dataset and each variable.
The ICE profiles are useful to highlight the presence of interactions. Infact, the stronger the interaction effects associated with the variable j, the greater the differences in shape observed across ICE profiles. However, this model agnostic tool does not reveal with which other variable the interaction arises. Note that, by construction, ICE profiles of a given covariate j are parallel as long as the underlying model does not incorporate interactions (like in a GLM or in a QR on the log scale transformation).
The partial dependence (PD) profiles of Friedman & Popescu (Reference Friedman and Popescu2008b) are obtained averaging different ICE profiles of a given variable j. PD profiles can be viewed as the main effect of covariate j merged over all observations. In other terms, they represent the average effect of variable j and are able to display whether the relationship between the response and a covariate is linear, monotonic, or more complex.
PD marginalises the model’s prediction over the distribution of the covariates in $\boldsymbol{x}_{.,-j}$ , so that the profile displays the relationship between variable j and the output produced by the model.
If we consider the ICE profiles obtained for variable j over a set of n observations, $ICE_{i,\textrm{v}_{k}}^{j}$ where $i=1,...,n$ and $k=1,...,V$ , the PD profile is obtained as
where, again, V is the grid of possible values for the selected covariate. The $PD_{\textrm{v}_{k}}^{j}$ function, for a given value $\textrm{v}_{k}$ of variable j, reveals the average marginal effect on the prediction returned by the model. Note that PD profiles are not restricted to a single variable, it is also possible to consider multiple variables at the same time in order to study their joint effect on the response.
In Figures 5 and 6, we consider PD plots and individual conditional expectations (ICEs) to gain an understanding of the main effects of the variables over the conditional quantile of the total claim severity, for the different models. The different ICE profiles in Figure 6 were coloured to report the amount of the observed claim (the darker the colour the higher the amount) to detect possible peculiar profiles. The top-left plot in Figure 5 compares the PD plots for the AG variable produced by the different models. At first glance, the curves look quite similar, however taking a closer look we notice an important difference in the leftmost part of the plot, for the values below 25 years of age. More specifically, for both QRNN and Quantile-CANN, we observe an upward trend in the riskiness of the insured, starting from 500 euros and peaking at 1, 500 euros at around 15 years of age, then the curve falls off down to 700 euros at 25 years. While the QR shows a flat trend kicking off at around 1, 000 euro and slowly increasing after 25 years of age. The behaviour displayed by network models seems to be more reasonable, since it captures the cost for dental treatment in younger ages, as it is usually low for children below 10 years, while it raises significantly for teenagers often associated with the use of dental braces. The ICE curves associated with the PD plot displayed in the top-left plot of Figure 6 do not seem to reveal specific interactions within the neural network models, since all the profiles look almost parallel in the plot, which is the case when the variable has little to no interactions with other variables. Of course, the ICE profiles for the QR are always parallel (on a log scale), since the variables are always modelled additively.
The permanence (PE) PD plots display a similar evolution but at different levels (top-right pane in Figure 5). Furthermore, for this variable, we observe some wild ICE profiles (top-right pane in Figure 6), that may signal the presence of interactions with other variables. The Gender (GE) variable does not seem to have a particular behaviour, in fact, the range for the y-axis is pretty narrow. As for the regional variable (RE), the PD plots show almost identical profiles, with a particularly riskiness associated with insured living in Lazio and Valle d’Aosta.
4.5 Interaction effects
After having a look at main effects and at some clues on possible interactions in Section 4.4, here we closely study possible interaction effects between covariates captured by the network models. When a given model incorporates an interaction effect, its predictions cannot be expressed as the sum of its variables’ main effects, because the effect of one variable depends on the value of another variable. To assess the presence of interaction effects, we adopt the H-statistic introduced by Friedman & Popescu (Reference Friedman and Popescu2008b), which estimates the interaction strength between two covariates by measuring how much of the prediction variance originates from their interaction. To measure pairwise interaction strength between covariates j and k, H-statistic is defined as follows:
where the sums run over a subset of n randomly selected observations, $\bar{PD}^{k}$ is the centered versionFootnote 6 of the PD profile for variable k, and $\bar{PD}^{jk}$ is the centered two-way PD for variable j and k. In other words, $H_{jk}$ measures the proportion of variability in the joint effect of $\boldsymbol{x}_{.,j}$ and $\boldsymbol{x}_{.,k}$ unexplained by their main effects. A value close to zero indicates almost no pairwise interaction, while a value close to one means that most effects come from the pairwise interaction. The H-statistic can also be larger than 1, this can happen when the variance of the joint interaction is larger than the variance of the 2-dimensional PD plot. Hence, the interaction strength is measured as the share of variance explained by the interaction.
In Figure 7, we plot the values of the H-statistic for each model and each possible pairwise interaction. As discussed above, the QR model does not display any interaction between covariates, since it is not designed to do so. While, both QRNN and Quantile-CANN display some specific interactions. The strongest common interaction between the two models is the one between the firm’s dimension (DM) and the gender of the insured (GE), followed by the one between the Permanence (PE) and, again, the Gender (GE). To gain insight on the behaviour of the interaction effects, in Figures 8 and 9, we report PD plots for Dimension (DM) and Permanence (PE), grouped with respect to Gender (GE). The interaction appears to be relevant if the curves display a different behaviour when conditioned to the different values of the gender variable.
For the first interaction, in Figure 8, we notice that both network models recognise higher riskiness to male insured, indeed in the two panels, the blue line (in Figure 8) lies always above the red curve. However, there are some relevant differences between QRNN and Quantile-CANN that are worthy of discussion. In particular, in the QRNN plot, the two curves display a downward parabolic trend which is rather steep for male insured and far flatter for females. Both curves reach their minimum at around 25 years of permanence and then they start increasing again, reaching 1, 900 euro for males and 1, 700 euros for females, the difference between the two curves is rather stable across the plot (though not constant). For Quantile-CANN, we observe a somewhat different behaviour, in fact, the two curves display a large distance for the insured with less than 10 years of permanence, while in the right part of the plot the distance in terms of potential riskiness between males and females is almost negligible.
Also for the interaction between dimension (DM) and gender (GE), the grouped PD plots (Figure 9) report an higher riskiness associated with male insured. Both panels show an upward parabolic trend with a maximum at around 500 for both plots for the PD plots associated with males insured. In the QRNN panel, we notice a quasi-sinusoidal trend for the females grouped PD plot, with a very low variability for the values in the y-axis. In other terms, for the QRNN, the dimension variable only has some sort of significant effect for males, while for females the effect of this variable seems negligible. As for the Quantile-CANN female plot, we have first had a substantially stable trend, then followed by a downward parabolic trend, with a good variability for the insured potential riskiness.
Thus, even though the gender variable did not appear to be significant in Figures 4 and 7, it has some relevance when interacting with other covariates. As shown above, network models have proven to be good at detecting such interactions.
4.6 Ratemaking
In Section 4.2, we have investigated models’ behaviours evaluated at different quantile levels. We are now interested in evaluating models’ performances when the focus is estimating the quantile premium principle introduced by Heras et al. (Reference Heras, Moreno and Vilar-Zanón2018), where the premium paid by the insured is loaded according to its potential riskiness. In particular, we consider the convex combination of the quantile claim severity $Q_{\tilde{S}_{i}}(\tau^{\star}_{i}|\boldsymbol{x}_{i})$ and the conditional expected value of the total claim amount $S_{i}$ :
where $0<\gamma<1$ is the loading factor and $E(S_{i}|\boldsymbol{x}_{i})$ the conditional expected value of the total claim amount that can be decomposed as $E(S_{i}|\boldsymbol{x}_{i})=Pr(N_{i}>0|\boldsymbol{x}_{i})\cdot E(\tilde{S}_{i}|\boldsymbol{x}_{i})$ . Note that $Pr(N_{i}>0|\boldsymbol{x}_{i})=1-p_i$ can be obtained from Equation (1), while $E(\tilde{S}_{i}|\boldsymbol{x}_{i})$ is estimated using a Gamma regression model as in Frees (Reference Frees2010). In order to compute (18), we need to obtain $\hat{Q}_{\tilde{S_{i}}}(\tau^{\star}_{i}|\boldsymbol{x}_{i})$ using the two-part model discussed in Section 2.
The two-part model is estimated in taking the first fold in the 5-fold cross-validation, where we consider a 60-20-20 split between learning, validation, and testing set. The first step of the two-part model consists in estimating the no claim probability $p_{i}$ for each insured using logistic regression. Then as discussed in Section 2, the estimated $p_{i}$ is employed to compute the $\tau^{\star}_{i}$ level as in Equation (2) for each insured setting $\tau=0.95$ . As a result of this first step, we obtain 47,120 unique values for $\tau^{\star}_{i}$ ranging between $0.799$ and $0.896$ . Following the two-part approach designed by Heras et al. (Reference Heras, Moreno and Vilar-Zanón2018) would involve fitting a regression model for each different quantile level $\tau^{\star}_{i}$ . Doing this would be rather time-consuming. Thus, to avoid that, we approximate the $\tau^{\star}_{i}$ values up to the second digit, where the second digit is rounded to the closest even digitFootnote 7 we perform the second step of the two-part model by computing the conditional quantile of the claim severity $\hat{Q}_{\tilde{S_{i}}}(\tau^{\star}_{i}|\boldsymbol{x}_{i})$ using QR, QRNN, and Quantile-CANN.
In order to test the ability of the different models to accurately estimate the desired quantile level of $\tilde{S_{i}}$ , we use the backtest criteria approach. More specifically, the models are backtested using the unconditional coverage (UC) test proposed by Kupiec (Reference Kupiec1995), which is a widespread testing technique generally employed to validate VaR models in the financial literature. This technique consists in a binomial test checking if the proportion of insured with a claim severity above the conditional quantile is consistent with the predefined quantile level $\tau^{\star}$ . The UC test performs a likelihood ratio test, where the null hypothesis of the test states that the probability of a violationFootnote 8 is equal to ${\tau}_{c}=(1-\tau^{\star})$ .
More formally, given a generic insured i, we can define the violation function as
Hence, given a portfolio composed of I insured, we define the total amount of the violations occurred in the portfolio as $I(\tau^{\star})=\sum_{i=1}^{I}I_{i}(\tau^{\star})$ , and the ratio of occurred violations as
It is now possible to define the UC test statistic as
The null and the alternative hypothesis for the UC test are defined as
In other terms, considering a portfolio of I insured, if the number of occurred violations $\hat{\tau}_{c}\cdot I$ is close enough to $\tau_{c}\cdot I$ , the test statistic ${LR}_{uc}$ is low, and the null hypothesis is not rejected. There is no evidence of any inadequacy for the tested quantile estimate. While, if the number of violations strongly differs from $\tau_{c}\cdot I$ , the test statistic increases indicating growing evidence that the proposed quantile either systematically understates or overstates the portfolio’s potential riskiness, and thus the null hypothesis is rejected. In Table 4, we report the backtesting results for QR, QRNN, and Quantile-CANN. In the left part of the table, we report the unconditional coverage test statistic $LR_{uc}$ , while on the right side of the table, we display the corresponding p-values, keeping in mind that the null hypothesis is not rejected when the p-value is larger than 0.05. From Table 4, we observe that QRNN and Quantile-CANN pass the backtest at almost all quantile levels since the null hypothesis of unconditional coverage is not rejected. More specifically, the QRNN and Quantile-CANN pass the test at all quantile levels, with the sole exception of the $\tau^{\star}=0.8$ level, while QR fails the backtest at two quantile levels: $0.86$ and $0.82$ . Therefore, QRNN and Quantile-CANN seem more able to accurately estimate the quantile of the total claim severity $\hat{Q}_{\tilde{S_{i}}}(\tau^{\star}_{i}|\boldsymbol{x}_{i})$ .
Once the models are estimated, we are able to calculate tariffs $P_{i}$ as in Equation (18) and evaluate them using the ordered Lorenz curve introduced by Frees et al. (Reference Frees, Meyers and Cummings2014). This tool is a twist on the classical Lorenz curve usually employed in welfare economics to represent social inequality via the Gini index, see Farris (Reference Farris2010). In insurance literature, the ordered Lorenz curve is employed to compare different tariff structures issued by a set of competing models. Given a base tariff structure $P_{i}^{base}$ and a competing tariff $P_{i}^{comp}$ the Lorenz curve proposed by Frees et al. (Reference Frees, Meyers and Cummings2014) is ordered with respect to the relativity $r_{i}$ :
A relativity $r_{i}$ consistently below 1 reveals a largely profitable policy for the company, that is likely to be lost to a competing insurance company proposing a cheaper premium. Instead a relativity $r_{i}$ greater than 1 signals an underpriced policy. Of course these statement hold true only if we assume $P_{i}^{comp}$ to give a sharper representation of the real risk compared to $P_{i}^{base}$ .
Given $r_{i}$ , the ordered Lorenz curve can be defined as follows:
for s $\in$ [0, 1] where $F_{n}(r_{i})$ is the empirical cumulative distribution function of the relativities $r_{i}$ . The idea behind the ordered Lorenz curve is that a model producing tariffs with a greater Gini index produces a stronger separation among premiums paid by the insured, signaling that such model is more capable to distinguish good risks from bad risks. Hence, a tariff structure $P_{i}^{comp}$ that yields a larger Gini index is likely to result in a more profitable portfolio because of a better risk differentiation.
Table 5 displays the two-way comparison of Gini indices for the network models and the QR. The rows report the model generating the base tariff structure $P_{i}^{base}$ whereas the column stores the model from which the competing tariff structure $P_{i}^{comp}$ is generated. The approach we use for selecting a tariff based on the Gini index is the “mini-max” strategy designed by Frees et al. (Reference Frees, Meyers and Cummings2014), which consists of selecting the model that provides the minimum Gini index among the maximal Gini indices taken over the competing models. The strategy is rather intuitive: if we have to choose a base premium we chose the one that has the minimal maximum improvement when compared to other models, meaning that the selected base premium is the least vulnerable to alternatives. In practice, we look for the base model with the lowest value in bold in Table 5. The Quantile-CANN appears as a clear winner since the other models are not able to achieve an high Gini index when considered as an alternative (see the last row), signaling that this model leads to a tariff structure that is the least likely to incur in adverse selection. The QRNN tariff structure achieves second place, followed by QR.
5. Conclusions
In the domain of insurance ratemaking, neural networks may be considered a tool to perform high-dimensional nonlinear regressions. Following this line, in this paper, we extend such techniques in order to estimate the conditional quantile of the total claim amount in the context of a two-part model devoted to the definition of a loaded premium. To fulfill the quantile estimation task, we propose two models. First, we use QRNN of Taylor (Reference Taylor2000), which is a particular neural network specification that has never been applied before in actuarial sciences. Next, we generalise the CANN approach of Schelldorfer & Wüthrich (Reference Schelldorfer and Wüthrich2019) introducing the Quantile-CANN, a flexible tool that merges the classical QR with a QRNN allowing to improve the results given by the QR model and the QRNN.
In the first part of the empirical application, Section 4.2, we test the performance of the proposed network models against the traditional QR model over a health insurance claim dataset. The results show that our models outperform QR in terms of quantile loss function. Furthermore, for QRNN and Quantile-CANN, we apply the set of model agnostic tools discussed in Lorentzen & Mayer (Reference Lorentzen and Mayer2020) to gain additional insight in the dataset.
In the second part of the empirical application, Section 4.6, we compute the Quantile Premium Principle introduced by Heras et al. (Reference Heras, Moreno and Vilar-Zanón2018) using the different models discussed in the paper (QRNN, Quantile-CANN, and QR). To compare the tariff structures issued by the models, we adopt the ordered Lorenz designed by Frees et al. (Reference Frees, Meyers and Cummings2014). According to this technique, the proposed network models exhibit a better tariff structure w.r.t QR since they provide a better risk differentiation for the portfolio. This feature is paramount for the insurer since a better differentiation between good and bad risks is likely to increase profits.
This work focuses on the comparison between the neural network-based models and the quantile regression, further research could explore different network configurations, i.e., increasing the number of dimensions for the embedding layer used for the regional variable or even different machine learning models, such as tree-based models or gradient boosting machines. Another possible development of this work could consider a multivariate QRNN or Quantile-CANN approach in order to jointly model the conditional quantile of the total claim severity for different and possibly correlated claim types.