A comparison of three statistical methods for analysing extinction threat status

HEATHER R. TAFT; DEREK A. ROFF; ATTE KOMONEN; JANNE S. KOTIAHO

doi:10.1017/S0376892913000246

A comparison of three statistical methods for analysing extinction threat status

Published online by Cambridge University Press: 05 August 2013

ATTE KOMONEN and

HEATHER R. TAFT*: Affiliation:
Department of Biology, University of California, Riverside, California 92521, USA
DEREK A. ROFF: Affiliation:
Department of Biology, University of California, Riverside, California 92521, USA
ATTE KOMONEN: Affiliation:
Department of Biological and Environmental Science, University of Jyväskylä, PO Box 35, FI-40014, Finland
JANNE S. KOTIAHO: Affiliation:
Department of Biological and Environmental Science, University of Jyväskylä, PO Box 35, FI-40014, Finland
*: *Correspondence: Dr Heather Taft e-mail: [email protected]

Article contents

Summary
INTRODUCTION
METHODS
RESULTS
DISCUSSION
CONCLUSIONS
References

Rights & Permissions

Summary

The International Union for Conservation of Nature (IUCN) Red List provides a globally-recognized evaluation of the conservation status of species, with the aim of catalysing appropriate conservation action. However, in some parts of the world, species data may be lacking or insufficient to predict risk status. If species with shared ecological or life history characteristics also tend to share their risk of extinction, then ecological or life history characteristics may be used to predict which species may be at risk, although perhaps not yet classified as such by the IUCN. Statistical models may be a means to determine whether there are non-threatened or unclassified species that share the characteristics of threatened species, however there are no data on which model might be most appropriate or whether multiple models should be used. In this paper, three types of statistical models, namely regression trees, logistic regression and discriminant function analysis are compared using data on the ecological characteristics of Finnish lepidopterans (butterflies and moths). Overall, logistic regression performed slightly better than discriminant function analysis in predicting species status, and both outperformed regression trees. Uncertainty in species classification suggests that multiple analyses should be performed and particular attention devoted to those species for which the methods disagree. Such standard statistical methods may be a valuable additional tool in assessing the likely threat status of a species where there is a paucity of abundance data.

Keywords

discriminant function analysis Geometridae Lepidoptera logistic regression Noctuidae regression tree analysis threat status

Type: Papers
Information: Environmental Conservation , Volume 41 , Issue 1 , March 2014 , pp. 37 - 44

DOI: https://doi.org/10.1017/S0376892913000246 [Opens in a new window]
Copyright: Copyright © Foundation for Environmental Conservation 2013

INTRODUCTION

Assessing the risk of extinction is important to determine which species are most prone to extinction and may be in need of human intervention. The International Union for Conservation of Nature (IUCN) has defined categories assessing species threat status on the basis of their risk of extinction. Classifications largely rely on quantitative information, but in practice expert opinion plays a strong role. Many different methods have been used to assess extinction risk, including population models (McCarthy et al. Reference McCarthy, Keith, Tietjen, Burgman, Maunder, Master, Brook, Mace, Possingham, Medellin, Andelman, Regan, Regan and Ruckelshaus2004), species-area correlations (Grelle Reference Grelle2005) and genetic analyses (Dunham et al. Reference Dunham, Peacock, Tracy, Nielsen and Vinyard1999). Obtaining the type of data needed to assess extinction risk can be problematic. For example, the assessment of population size and changes in population size can be difficult, particularly when assessing small populations whose individuals are not easily located. Genetic data, which may provide information about the degree of inbreeding and gene flow among populations, is costly, especially if multiple species are to be assessed, and may also be hard to obtain from small populations. However, data on simple ecological or life history characteristics can be obtained using less extensive population monitoring, or from existing knowledge of the natural history of species. In this paper, we assess the potential of such data to provide an alternative and reliable measure of extinction risk.

Among the studies that use ecological characteristics to assess extinction risk, several different types of analyses have been used, including multiple regression (Purvis et al. Reference Purvis, Gittleman, Cowlishaw and Mace2000; Krüger & Radford Reference Krüger and Radford2008), regression tree analysis (Boyer Reference Boyer2008; Davidson et al. Reference Davidson, Hamilton, Boyer, Brown and Ceballos2009; Boyer Reference Boyer2010), logistic regression (Mattila et al. Reference Mattila, Kaitala, Komonen, Kotiaho and Paivinen2006, Reference Mattila, Kotiaho, Kaitala and Komonen2008; Franzen & Johannesson Reference Franzen and Johannesson2007) and risk ranking (Kotiaho et al. Reference Kotiaho, Kaitala, Komonen and Paivinen2005). Several of these studies used multiple tests to assess extinction risk, often initially analysing single ecological or life history characters as predictor variables, followed by one of the single type of statistical tests mentioned above to analyse the complete data set. Bielby et al. (Reference Bielby, Cardillo, Cooper and Purvis2010) compared decision trees and phylogenetic comparative methods. Here we compare three statistical approaches: regression tree analysis, logistic regression and discriminant function analysis. We selected these three statistical methods primarily because they are commonly found in pre-packaged statistical software programs and thus would be easier for conservation managers to access and use than more complicated programs that require writing and editing code. Our aim was to determine if one or a combination of these statistical methods could be used to predict the threat category, threatened or non-threatened, of as yet unclassified species. Such an analysis may determine whether there are unclassified species that merit immediate attention. Reclassification of a formerly non-threatened species to a threatened status is also a possible outcome, and may indicate that a species is in more immediate need of conservation management measures. The specific situation we envision is one in which most species within a particular taxon have been classified as threatened or non-threatened. If it can be shown that this classification is well predicted by one or more of the three statistical approaches using general ecological and life history characters (hereafter, for simplicity, LH characters) then unclassified species of the same taxon that share characteristics with threatened species can be identified to help prioritize further assessments of threat status.

We assess this approach using data on Finnish lepidopterans previously analysed by Komonen et al. (Reference Komonen, Grapputo, Kaitala, Kotiaho and Paivinen2004), Kotiaho et al. (Reference Kotiaho, Kaitala, Komonen and Paivinen2005) and Mattila et al. (Reference Mattila, Kaitala, Komonen, Kotiaho and Paivinen2006, Reference Mattila, Kotiaho, Kaitala and Komonen2008). Komonen et al. (Reference Komonen, Grapputo, Kaitala, Kotiaho and Paivinen2004) used analysis of variance and linear regressions on subsets of LH characters to assess butterfly mobility, but did not relate this to IUCN threat status. Mattila et al. (Reference Mattila, Kaitala, Komonen, Kotiaho and Paivinen2006, Reference Mattila, Kotiaho, Kaitala and Komonen2008) used similar analyses, running logistic regressions of IUCN threat status on single LH characteristics and then using multiple LH characteristics in a multinomial logistic regression to determine the ability to classify species into their correct IUCN threat status. Kotiaho et al. (Reference Kotiaho, Kaitala, Komonen and Paivinen2005) primarily used t-tests and logistic regressions with single LH characteristics to compare differences between threatened and non-threatened species. The four LH characteristics (dispersal ability, larval specificity, adult habitat breadth and length of flight period) that they found to be significantly related to IUCN threat status were used to create an ecological risk ranking of all the species by ranking the species according to each LH characteristic and then summing the four rankings. This summed ranking was used in a logistic regression and compared to the actual IUCN threat status of the species.

In the present analysis, we address the question of whether the LH characteristics of a species may be used directly en masse to predict the probability of unclassified species being threatened or non-threatened, and identify non-threatened species that may need reclassification of their threat status.

METHODS

Regression and classification trees

Roff and Roff (Reference Roff and Roff2003) initially suggested regression tree analysis to determine factors contributing to the risk of extinction. Several studies have since used a regression tree analysis to assess extinction probability (Jones et al. Reference Jones, Fielding and Sullivan2006; Boyer Reference Boyer2008, Reference Boyer2010; Davidson et al. Reference Davidson, Hamilton, Boyer, Brown and Ceballos2009). In our case study the predicted category consisted of two states, ‘threatened’ ( = 1) and ‘non-threatened’ ( = 0), and thus regression and classification trees were identical. In principal, classification trees could be used to identify each different IUCN category. We restricted the analysis to the two states and used the regression tree approach, owing to insufficient data.

Logistic regression

Our aim was to determine the accuracy of models created using logistic regression at predicting the correct assignment of a species as threatened or non-threatened based on LH characteristics. As done with the regression trees, species in the non-threatened category were coded as 0 and those in the threatened category as 1. To use the model as a mechanism for placement of a species into the threatened or non-threatened category, we required a threshold for the fitted value, for example 0.5, above which species were placed into the threated category and below which they were placed into the non-threatened category. Alternatively, we could have selected two thresholds, such as 0.25 and 0.75, and placed species below the lower threshold in the non-threatened category and species above the upper threshold in the threatened category, classifying species lying between the two thresholds as ‘uncertain.’ We explored the consequences of both approaches.

An important point to note in this method of analysis is that, whereas the stopping point for the stepwise regression is defined by a metric such as the Akaike information criterion (AIC), the adequacy of the model in the present context was measured by the assignment to the two categories: because of this, it was possible for the ‘best’ model to contain more or fewer variables than that specified by the ‘best’ stepwise logistic model.

Discriminant function analysis

There are several covariance structures that can be identified when performing a discriminant function analysis: homoscedastic, spherical, proportional, group spherical, equal correlation and heteroscedastic. Principal components can also be specified for analysis in a discriminant function analysis. In the analyses here, to compare statistical tests, we used the covariance structure that correctly assigned the most threatened species based on LH characteristics.

Determining the probability of correct assignment by chance alone

An important consideration is the probability of assigning a species to the correct category, threatened or non-threatened, by chance alone. To determine this we used a simulation model (coding in R given in Appendix 1, see supplementary material at Journals.cambridge.org/ENC). First, we generated a vector, V, of length N, where N was the total number of species in the sample with ones in the first n ₁ rows and zeros in the remaining N – n ₁ rows, the former being the number of threatened species and the latter the number non-threatened species. These zeros and ones were rearranged at random in the vector. The number of correct assignments in the threatened category, N ₁, was given by

N_1 = \sum\limits_{i = 1}^{n_1 } {V_i } ,

that is, the number of ones in the first n₁ rows, and the number of correct assignments in the non-threatened category, N ₀ was

N_0 = N - n_1 - \sum\limits_{i = n_1 + 1}^N {V_i } ,

namely the number of zeros in the remaining N – n ₁ rows. The total number of correct assignments, N _T, was thus N_T =N ₁+N ₀. These three numbers were stored and the process repeated to generate three new samples. The whole process was replicated 10 000 times, generating a matrix with three columns (total correctly assigned, correctly assigned to threatened, correctly assigned to non-threatened) and 10 000 rows. From this matrix we calculated the distribution of correct assignments. For each column we determined the probability of correctly assigning n_j species (j = 0 to N, j = 0 to n ₁, j = 0 to N – n ₁) as S _J/10000, where S _J was the number of times n _j appeared in the relevant column.

Determining the preferred method

As indicated below, regression tree analysis was not satisfactory for either of the two data sets (butterflies or moths) and, therefore, our further analysis focused upon logistic regression versus discrimination function analysis. We compared the ability of these two methods to correctly classify species into the threatened or non-threatened category with a χ² analysis. Of particular interest were those species which were incorrectly classified according to one or both methods: we plotted the predicted values from the logistic regression against the predicted values from the discriminant function analysis to see whether the species that were classified differently fell near the 0.5 cutoff. It is safer to classify non-threatened species as threatened than it is to classify threatened species as non-threatened, because in the former case a species will receive attention, but in the latter case a threatened species that needs attention will be overlooked. Therefore, we used the number of correctly classified threatened species in a final comparison of methods to determine which method was to be preferred, at least for the data sets assessed here.

Data sets

The Kotiaho et al. (Reference Kotiaho, Kaitala, Komonen and Paivinen2005) butterfly data set consisted of 94 species and 13 predictor variables: family, genus, species, abundance, distribution, distribution change, resource distribution, extent of range, larval specificity, female size, length of flight period, mobility and habitat breadth (see Komonen et al. Reference Komonen, Grapputo, Kaitala, Kotiaho and Paivinen2004 and Kotiaho et al. Reference Kotiaho, Kaitala, Komonen and Paivinen2005 for variable definitions). Because the primary criterion for listing these species according to IUCN threat status is a function of three ‘distribution’ variables (distribution, distribution change, and extent of range), we included these variables as predictor variables to assess whether any of the other variables were better at predicting IUCN threat status. After initial analysis, these distribution variables were excluded from subsequent analyses to determine whether any more easily accessible variables (such as those obtainable from published natural history descriptions) could be used to predict threat status. One species, Glaucopsyche alexis, was listed using only abundance as the criterion, and so this species was not used in the analyses. Thirteen other species were excluded due to missing data. The analysed data consisted of 18 threatened and 62 non-threatened species. The rest of the variables, excluding resource distribution due to lack of data, were used to predict IUCN threat status as given in the 2000 Finnish Red List (Rassi et al. Reference Rassi, Alanen, Kanerva and Mannerkoski2001). One of the principal aims of the analysis was to investigate the ability of variables that are readily available from published data on the natural history of a species to determine threat status. Therefore, we ran the analyses with and without the variable ‘abundance,’ which might typically be difficult to estimate and, in many cases, unavailable. However, due to the similarity of the results, only the analyses excluding abundance are reported here (Appendix 2 provides a table of the analyses including abundance, see supplementary material at Journals.cambridge.org/ENC).

Two data sets on moths were used, one on noctuids (Mattila et al. Reference Mattila, Kaitala, Komonen, Kotiaho and Paivinen2006) and one on geometrids (Mattila et al. Reference Mattila, Kotiaho, Kaitala and Komonen2008). These two data sets consisted of 284 and 306 species, respectively, and each had the same eight predictor variables: genus, species, male size, length of flight period, larval specificity, resource distribution, overwintering stage, and distribution change. After we ran the analyses with each data set, we combined them into a single data set with the added variable ‘taxonomic family’ to increase the power of the analyses and avoid issues of non-independence by using closely related species. In total, we excluded 40 species from the data sets due to missing data. The analysed data consisted of 68 threatened and 482 non-threatened species. Distribution change was the only distribution variable for these data sets and, after initial analysis, was, as before, excluded to determine which other variables may be important for predicting IUCN threat status. As previously noted, the response variable for all data sets was binomial, threatened or non-threatened. It included all species listed as near threatened or higher according to the IUCN threat status as threatened and the rest as non-threatened.

RESULTS

Among the three distribution variables used in the butterfly data set (distribution, range position, and distribution change), only distribution and range position were highly correlated (r = 0.64; Table 1). Correlations between any of the distribution variables and the other variables was highest for distribution and mobility (r = 0.74) and distribution and length of flight (r = 0.64), although these values were not high enough to cause problems with collinearity since they were < 0.90 (Tabachnick & Fidell Reference Tabachnick and Fidell2007, p. 89–90). None of the correlations among variables for the geometrid or noctuid data sets exceeded an absolute value of 0.26 and thus did not pose problems with collinearity.

Table 1 Correlations among the continuous variables used in the analyses of the butterfly data (sample size = 18 threatened species and 62 non-threatened species). *p < 0.05, **p < 0.01, ***p < 0.001; probabilities not corrected for multiple test.

We first examined the ability of the distribution variables by themselves to classify species status. After this we considered the ability of the other LH characteristics to classify species status.

Regression trees

Distribution variables only

A significant regression tree, with two nodes was obtained (p = 0.0002) using only the distribution variables from the butterfly data set. This split in the regression tree was based on distribution and correctly classified 94% of threatened species and 95% of non-threatened species. When we used the variable distribution change in a regression tree analysis on the geometrid data, the pruned regression tree was significant (p = 0.0006) and had two nodes, but all threatened species were misclassified as non-threatened. The threatened species did not have any defining range of distribution change to use to divide the data. When we used only distribution change in a regression tree analysis on the noctuid data set, the pruned regression tree was significant (p = 0.0002) and had five nodes with a misclassification rate of 11%. Using distribution change in a regression tree analysis on the combined moth data set, we found that the pruned regression tree was significant (p = 0.0002) and had four nodes, but all threatened species were misclassified, again indicating no defined range for the threatened species.

Other LH variables

Using variables from the butterfly data set not explicitly used for determining IUCN threat status (family, mobility, larval specificity, habitat breadth, female size, and flight length), a pruned tree could not be created because only one terminal node was produced during the cross-validation. Thus, in this case, regression tree analysis could not discriminate threatened from non-threatened species.

When all the variables (male size, length of flight period, larval specificity, and overwintering stage) except distribution change were used from the geometrid data set, a non-significant pruned tree with two nodes resulted (p = 0.2826), but no threatened species were correctly classified because both nodes were classified as non-threatened in the regression tree. Thus, breaking the data down into these two nodes based on length of flight period did not allow for enough subdivision of the data to correctly assign IUCN threat status. When all the variables from the noctuid data set or the combined data set (male size, length of flight period, larval specificity, and overwintering stage) except distribution change were used, a pruned tree could not be created because only one terminal node was produced during the cross-validation.

We conclude that for these data sets regression tree analysis did not result in a satisfactory prediction of threat status.