Introduction
Snow stability data are, as avalanche occurrence data, most closely related to avalanche release probability, which is the key parameter to be forecasted by any avalanche-forecasting service or backcountry operation. In the context of avalanche forecasting, observations that provide information on snow stability have been termed low-entropy data (Reference LaChapelleLaChapelle, 1980) or Class I data (Reference McClung and SchaererMcClung and Schaerer, 2006).
Snow avalanches are, unlike most other mass movements, to a certain degree predictable (Reference Schweizer, Campbell, Conger and HaegeliSchweizer, 2008). Even more important, in the case of snow avalanches direct observations can be made that provide information on the probability of a mass movement within the next few days. No simple in situ tests exist to assess, for example, the landslide risk. In that respect, snow stability tests are unique. In this paper, we call any test that is made in situ and provides information on snowpack instability simply a ‘snow stability test’.
These tests are primarily indicators of whether triggering by localized dynamic loading (e.g. people or explosives), is likely. This is appropriate because most fatalities are caused by human triggering (e.g. Reference Schweizer and LütschgSchweizer and Lütschg, 2001).
Besides simple field observations on avalanche occurrence, shooting cracks and whumpfs (i.e. sound of collapsing weak layers), which all indicate instability (e.g. Reference Jamieson, Haegeli and SchweizerJamieson and others, 2009), a variety of in situ tests has been developed over the last five decades. All these tests aim to determine whether the sloping snowpack is stable. None of them provides the definitive answer. The reasons for the insufficiency of the tests are mainly the inherent limitations of the test (and its loading method) in replicating the avalanche release process, the spatially variable nature of the mountain snowpack (e.g. Reference Schweizer, Kronholm, Jamieson and BirkelandSchweizer and others, 2008a) and the complexity of the avalanche release process. Repeatedly, the question arose as to whether the tests are useful at all.
In this paper, we describe test requirements, review the most common existing tests and assess their performance. Test performance is assessed primarily based on a number of studies that compared various snow stability tests (e.g. Reference Gauthier, Jamieson, Campbell, Conger and HaegeliGauthier and Jamieson, 2008b; Reference Moner, Gavaldà, Bacardit, Garcia, Marti, Campbell, Conger and HaegeliMoner and others, 2008; Reference Simenhois and BirkelandSimenhois and Birkeland, 2009; Reference Winkler and SchweizerWinkler and Schweizer, 2009).
Requirements
It is generally perceived that snow stability tests should determine whether a particular slope is stable or unstable. This seems a rather unrealistic aim. First of all, when a test is made on an avalanche slope, that slope is assumed stable, otherwise the field team doing the test might have been caught in an avalanche and should not have entered the slope. Consequently, tests are often made on small slopes that do not avalanche or on slopes with a steepness <30°. By doing so, one has to assume that the conditions found on this relatively safe slope are representative of larger and/or steeper slopes of similar aspect in the surroundings. However, due to the spatially variable nature of the snowpack, extrapolation is not straightforward and requires experience. Therefore we suggest that a snow stability test cannot provide the ideal information, that is whether a slope is stable.
For the release of a dry-snow slab avalanche a number of requirements have to be fulfilled. These are: (1) the slope has to have a minimal slope angle (∼30°); (2) the snow layering has to be such that a cohesive slab layer overlies a weak layer; (3) this slab/weak-layer stratigraphy has to exist over a minimal extent of several tens of m2; (4) the snowpack has to be in a metastable condition, i.e. the strength of the weak layer is, at least at some locations, of similar magnitude to the applied stress; (5) there is an external (or internal) trigger present; (6) an initial failure tends to propagate; and (7) the slab breaks up and slides downslope, i.e. friction is overcome. The requirements that can be checked are (1), (2), (4) and (6). A trigger (5) inherently exists when applying a stability test. Information on requirement (3) is usually not available, but occasionally can be estimated if the slab/weak-layer combination and its formation are known, and (7) is given in most cases on large and steep slopes.
In the context of fractures within weak snowpack layers, we distinguish between fracture initiation and fracture propagation as in Reference Schweizerr, Jamieson and SchneebeliSchweizer and others (2003). A fracture or crack initiates if a localized dynamic perturbation (e.g. a trigger such as a skier or tapping on the top of a snow column) causes a crack in a weak layer. A crack or fracture that does not advance (propagate) beyond the influence of the localized perturbation is subcritical. A crack or fracture that advances beyond the influence of the localized perturbation has started to propagate. Fractures that were initiated but did not propagate across the slope were documented, for example, by Reference Van Herwijnen and JamiesonVan Herwijnen and Jamieson (2005).
In summary, the snow stratigraphy (slab over weak layer), failure initiation and fracture propagation are the most important requirements (apart from steepness) for a dry-snow slab avalanche (Reference McCammon and SharafMcCammon and Sharaf, 2005). A snow stability test needs to provide information on whether (or better, to what extent) these requirements are fulfilled. In addition, the test should not be difficult to perform, not require special equipment, be completed in less than ∼30min and provide robust repeatable results.
Not much is known about the application of snow stability tests in wet snow, and all results described below refer to dry snow conditions.
Snow Stability Tests
We consider two types of observation that provide information on snowpack instability: (1) observations that do not require digging (Reference Jamieson, Haegeli and SchweizerJamieson and others, 2009); and (2) observations that require digging a snow pit.
The first category includes whumpfs (sudden failure of the weak layer due to rapid localized loading manifesting itself by collapse), shooting cracks and recent avalanching. These three observations are all unambiguous signs of instability. Further simple observations are, for example, the ski-pole test (Reference TremperTremper, 2008) and cracking at skis.
The second category includes the snow profile and all tests that in one way or another apply an additional load to the snowpack to induce a failure. The latter include the shear frame test (Reference RochRoch, 1966; Reference Jamieson and JohnstonJamieson and Johnston, 2001), the shovel shear test (Reference McClung and SchaererMcClung and Schaerer, 2006), the hand shear test (Reference TremperTremper, 2008), the rutschblock test (RB; Reference FöhnFohn, 1987a), the compression test (CT; Reference JamiesonJamieson, 1999), the extended column test (ECT; Reference Simenhois, Birkeland and GleasonSimenhois and Birkeland, 2006) and the propagation saw test (PST; Reference Gauthier and JamiesonGauthier and Jamieson, 2006). The detailed procedures for all these tests are described in Reference GreeneGreene and others (2009).
The investigation of snow stratigraphy can be combined/quantified with a structural instability index such as the threshold sum (Reference Schweizer and JamiesonSchweizer and Jamieson, 2007). In addition, we include the snow micro-penetrometer (SMP; Reference Schneebeli and JohnsonSchneebeli and Johnson, 1998) in our comparison. It records a penetration resistance-depth profile and potentially can be used to assess snowpack instability (Reference Bellaire, Pielmeier, Schneebeli and SchweizerBellaire and others, 2009).
The shovel (or hand) shear test attempts to shear off the slab above the weak layer and hence provides an index of shear strength. It is primarily used to find weak layers rather than to assess weak-layer strength. However, to load the column properly, the weak-layer depth has to be known. The application of the load is delicate and the rating of the load at failure is highly subjective. The shovel shear test has not been validated, whereas the hand shear test has been correlated with local avalanche danger (Reference Jamieson, Haegeli and SchweizerJamieson and others, 2009) but not with slope scale instability.
The shear frame test is the only in situ measurement method that measures shear strength. The weak layer to be tested has to be identified by other means such as a snow profile. The slab is removed apart from a few centimetres just above the weak layer. The shear frame test and the stability indices derived from it (Reference FöhnFohn, 1987b) were shown to be related to snowpack instability (Reference Jamieson and JohnstonJamieson and Johnston, 1998; Reference Jamieson, Zeidler and BrownJamieson and others, 2007). Shear frame measurements from study plots are particularly useful to monitor the temporal evolution of snow stability.
The RB, which can be considered the grandfather of all snowpack stability tests, involves isolating a snow column of 2.0 m (cross-slope) × 1.5 m (upslope). The block is then loaded in stages by a skier until slab failure. The loading step or score (1–7) is noted as well as the release type, i.e. the proportion of the block that released (whole block, most of block, edge only). It has been shown that the RB score is related to the probability of skier-triggered avalanches (Reference FöhnFohn, 1987a; Reference JamiesonJamieson, 1995) on the adjacent slope. The RB release type is assumed to be related to fracture propagation propensity, in particular since the RB area (3 m2) is - except for deep weak layers - larger than the area for which the skier’s load is significant (∼1 m2) (Reference Schweizer and CamponovoSchweizer and Camponovo, 2001). The fact that Reference Schweizer, McCammon and JamiesonSchweizer and others (2008b) found a substantially higher sensitivity for the RB release type (81%) than for the RB score (61%) likely supports this assumption.
With the CT a much smaller area (30cm × 30cm) is loaded by tapping onto the isolated column. The CT score was related to the probability of skier-triggered avalanches on adjacent slopes (Reference JamiesonJamieson, 1999) and the CT score can be related to the RB score. By introducing the fracture character (Reference Van Herwijnen and JamiesonVan Herwijnen and Jamieson, 2007), the interpretation of the CTwas improved and weak-layer/slab properties associated with sudden fractures (equivalent to Q1 shear quality as introduced by Reference Johnson and BirkelandJohnson and Birkeland, 1998) suggest that the fracture character is related to fracture propagation propensity (Reference Van Herwijnen and JamiesonVan Herwijnen and Jamieson, 2007b).
A number of other small column tests have been developed that all aim at replacing the subjective tapping onto the isolated column by a more quantitative loading procedure. Among those are the rammrutsch (or drop hammer) test (Reference Schweizer, Schneebeli, Fierz and FöhnSchweizer and others, 1995; Reference Stewart, Jamieson and StevensStewart and Jamieson, 2002), the stuffblock test (Reference Birkeland and JohnsonBirkeland and Johnson, 1999) and the quantified loaded column test (Reference Landry, Borkowski and BrownLandry and others, 2001). Since these are variations of the compression test and most have limited validation data, we do not include them in our analysis.
The recently developed ECT was introduced as a test that should provide information on fracture initiation and propagation (Reference Simenhois, Birkeland and GleasonSimenhois and Birkeland, 2006). The ECT involves isolating a column of 30 cm × 90 cm (with the longer side cross-slope) and loading it in one corner, as with the CT. It is noted whether a fracture crosses the entire column. Reference Simenhois and BirkelandSimenhois and Birkeland (2009) showed that the ECT is highly indicative of snowpack instability on nearby slopes.
The PST was inspired by traditional beam-type tests. It is a fracture mechanical test in which the resistance of a material to fracture in the presence of a crack is tested. In a fracture mechanical test, either the sample is loaded until failure for a given crack length, or the crack length is continuously increased (under constant load) until failure occurs. Reference Gauthier and JamiesonGauthier and Jamieson (2006) and Reference Sigrist and SchweizerSigrist and Schweizer (2007) were the first to report on a suitable design for a field test. A snow column (30cm × ∼100cm) is isolated with the longer side upslope. The length should be at least 100 cm or the slab thickness, whichever is longer. After the weak layer is identified by a separate test (e.g. the CT or profile), a cut is made with a snow saw along the weak layer until the crack length becomes critical and self-propagation of the crack starts. The critical crack length is noted and whether the crack propagates to the end of the column. As the free surface influences fracture propagation, D. McClung (personal communication, 2009) suggested not isolating the column at its upslope end. Reference Gauthier and JamiesonGauthier and Jamieson (2008a) validated the PST and showed that at sites where weak-layer fracture initiation was confirmed on adjacent slopes, PST results were clearly related to observations of fracture propagation.
Test Indication
Below we rate the above-described observations of snow-pack instability with regard to the three principal requirements outlined above: (1) slab/weak-layer stratigraphy; (2) failure initiation; and (3) fracture propagation.
Table 1 compiles the ratings for the simple observations that do not involve digging. Whereas whumpfs, shooting cracks and recent avalanching all indicate that the three requirements are fulfilled, the ski-pole test might at best show the existence of a (thick) weak layer (e.g. when a cohesionless layer (often consisting of depth hoar) is found below a slab). Cracking at skis only indicates that the surface layer is cohesive, but does not provide information on a possible weak layer.
A snow profile provides snow stratigraphy and shows whether a weak layer below a slab exists (Table 2). Whereas the shear frame test only indicates the strength of the weak layer, the shovel shear test is in addition partly suited for identifying weak layers. For the three tests that involve loading an isolated snow column, the slab/weak-layer stratigraphy is shown implicitly when the column fails. In addition, the RB (score and release type) and the ECT both provide information on initiation and propagation. For the CT the information on fracture propagation is less well related to fracture propagation than for the RB and ECT. Finally, the PST is clearly an index of fracture propagation propensity, but gives no indication on stratigraphy and limited indication on failure initiation. However, digging the pit and sawing might provide some indirect information on stratigraphy.
Test Limitations
The use of the tests is influenced by their reliability (see below) and their practicality. Table 3 summarizes some key practical limitations of the tests including the time requirement, required slope angle, effective depth and required technical skill level.
Other than the snow profile, which accompanies most tests, the RB is the slowest of the tests and the only one to require a sufficiently steep slope. The hand shear test is fast but is limited to weak layers within about 45 cm of the snow surface. The RB, CT, ECT, PST and SMP are all indicative in the 30–70 cm range, which is important for skier-triggered dry-snow slab avalanches. Reference Ross, Jamieson, Campbell, Conger and HaegeliRoss and Jamieson (2008) report the ECT to be reliable up to about 70 cm in the typically soft snow of the Columbia Mountains of western Canada, while Reference Simenhois and BirkelandSimenhois and Birkeland (2009) report indicative results up to about 100 cm in snow climates with wind-stiffened slab layers. The hand shear test, RB, CT and ECT have the advantage of requiring the least skill, whereas the snow profile, shear frame test and SMP require the most skill. The SMP is the only test mentioned in this study that requires expensive electromechanical equipment.
Test Accuracy
Early results relating the RB score to the probability of skier-triggered avalanches on nearby slopes have clearly shown that even for high RB scores of 6 or 7 occasionally skier-triggered avalanches were observed (e.g. Reference JamiesonJamieson, 1995). These false-stable predictions have shown that snow stability tests are not foolproof and hence that decisions on where and when to travel in avalanche terrain should never rely solely on stability test results. The false predictions have been attributed to spatial variations of snowpack stability but may also be related to differences between the slab-release process and the loading or support in the stability test.
When analysing the performance of snow stability tests, observed stability is compared with predicted stability. In most cases, only two categories of instability were considered: stable and unstable. For example, RBs were performed on skier-triggered (unstable) as well as skier-tested (stable) slopes and RB scores <4 were considered as unstable and ≥4 as stable. This type of analysis simplifies the comparison, but oversimplifies the problem. Nevertheless, we follow this approach and report the test performance by providing the probability of detection (POD, also called sensitivity), the probability of null events (PON, also called specificity) (Reference Doswell, Davies and KellerDoswell and others, 1990) and their mean, i.e. the unweighted average accuracy (RPC):
sensitivity:
specificity:
unweighted average accuracy:
A test should have both a high sensitivity and a high specificity. A high sensitivity means that most unstable situations are detected. If the specificity is high as well, then there are only few false alarms. A high sensitivity combined with a low specificity means that the test is oversensitive and produces many false alarms. Whereas false alarms have less severe consequences than misses, a low specificity is not desired since this will promote overcautious decisions which for regional forecasting will lead to a credibility problem in the long run (Reference WilliamsWilliams, 1980). On the other hand, as snow stability tests are commonly used to seek instability rather than stability (Reference McClungMcClung, 2002), an unbalanced performance with sensitivity larger than specificity is better than with specificity larger than sensitivity.
To assess the performance of snowpack tests, we consider various datasets (Table 4), primarily recent comparative studies. All datasets include a stability test score and an observed stability. The definition of observed stability may vary. For example, in Reference Schweizer, McCammon and JamiesonSchweizer and others (2008b) unstable refers to slopes that were skier-triggered. So the tests were made near the perimeter of a skier-triggered avalanche. In most other datasets slopes were rated as unstable if recent avalanches were observed on adjacent slopes, a whumpf was triggered on the test slope or any other sign of instability was observed. Occasionally, as in the study by Reference Winkler and SchweizerWinkler and Schweizer (2009), the slope was also rated as unstable based on an additional criterion, that is whether the profile was rated as poor (Reference Schweizer and WiesingerSchweizer and Wiesinger, 2001). This rating may favour the RB over the other test methods. For most datasets, at least two different tests were performed on the same test slope, thus allowing comparison of the relative performance of the tests (e.g. Reference Gauthier, Jamieson, Campbell, Conger and HaegeliGauthier and Jamieson, 2008b; Reference Moner, Gavaldà, Bacardit, Garcia, Marti, Campbell, Conger and HaegeliMoner and other, 2008; Reference Simenhois and BirkelandSimenhois and Birkeland, 2009; Reference Winkler and SchweizerWinkler and Schweizer, 2009). For the dataset presented by Reference Schweizer, McCammon and JamiesonSchweizer and others (2008b), only RB results are available. Dataset A contains test results from slopes where several CTs were performed adjacent to an RB by University of Calgary avalanche researchers. These so far unpublished data were collected in the Columbia Mountains of western Canada between December 1996 and March 2008. For all datasets, the predicted stability for a given test method is based on a threshold value. Depending on the test score, the test result is rated either as stable or unstable (Table 5).
Table 6 compiles the performance measures for the datasets presented in Table 4. As the datasets have distinct properties (e.g. in terms of design and circumstances) and not all test methods were included in each dataset, it is not meaningful to calculate an average performance for a given test method. Instead, one way of assessing the different test methods is to compare their unweighted average accuracy with that of the RB within the same dataset (where the sample size is large or similar), which provides the relative performance (Table 7). To check for differences in the performance of the various tests, the two-proportion Z-test (Reference Spiegel and StephensSpiegel and Stephens, 1999) was used.
In dataset A, the unweighted average accuracy of the RB or release type, the CT, as well as the CT and fracture character are within 0.01 of the value for the RB; for the RB and release type the accuracy is 0.09 lower; however, the difference is not statistically significant (p = 0.36). In dataset B, the unweighted average accuracy for the threshold sum is comparable to that for the RB; however, the value for the RB and release type is significantly higher (p = 0.01), while the value for the RB or release type is 0.04 (not significantly) higher (ρ= 0.34). In dataset C, the accuracy for the RB and release type is 0.04 lower, for the RB or release type and the ECT it is comparable, while the value for the CT is 0.16 lower and for the threshold sum is even lower–only the latter two differences are statistically significant. In dataset D, the unweighted average accuracy for the RB and release type is 0.06 lower, the value for the RB or release type is comparable to that of the RB, the value for the CT is 0.19 lower and for the threshold sum is 0.05 lower than for the RB (while the PST is based on a different sample size). None of the observed differences in unweighted average accuracy in dataset D are statistically significant (due to the partly low sample size). Within dataset E1, the accuracy for the ECT is 0.17 (significantly) higher than for the PST (ρ = 0.002). Finally in dataset F, the unweighted average accuracy for the ECT is 0.17 (significantly) higher than for the RB (ρ = 0.01).
In summary only for dataset C is the sample size sufficiently large, and several tests were included to allow a broader comparison. Based on dataset C, the RB and ECT have similar accuracy. On the other hand, dataset E1 suggests that the ECT performs better than the PST, and based on dataset F it seems that the ECT performs better than the RB. Finally, dataset D suggests that the RB and the PST have similar accuracy. Though datasets D, E1 and Fare fairly small, the problem of the obviously conflicting conclusions from the various studies cannot be resolved, in particular since none of the studies included all tests and the datasets have partly distinct properties. Based on the available studies and recognizing the differences between datasets, we therefore conclude that the RB, the ECT and the PST have similar accuracy and that the CTand threshold sum are less accurate. Below we report on some of the specific properties of the tests.
If the RB score alone is considered, the RB shows a low false alarm ratio (high specificity) but misses quite a number of unstable situations. The prediction can be improved largely by considering the release type as well (as has been shown by Reference Winkler and SchweizerWinkler and Schweizer, 2009). With an RB score ≥4 and only a partial release of the block, the conditions are very likely (99%) rather stable, whereas unstable conditions can be expected (94%) if either the RB score is low (<4) or the whole block is released. These findings have been confirmed by Reference Moner, Gavaldà, Bacardit, Garcia, Marti, Campbell, Conger and HaegeliMoner and others (2008). In the case of the CT, considering the fracture character only moderately improves the performance of the CT, which shows a high false alarm ratio, that is the CT is oversensitive.
The ECT shows a very balanced performance and, according to Reference Simenhois and BirkelandSimenhois and Birkeland (2009), has the best unweighted average accuracy of all tests. However, the results from the study by Reference Hendrikx, Birkeland and ClarkHendrikx and others (2009) indicate that under some conditions (which cannot be specified yet) the ECT can also be less accurate (about 40%).
The PST does indicate that propagation is unlikely for quite a number of unstable conditions, but shows an unweighted average accuracy of about 80%, comparable with most other tests.
With an unweighted average accuracy of almost 80%, the performance of the SMP is not much lower than the accuracy of the traditional snow stability tests. However, in contrast to most validation studies of the RB, CT, ECT and PST, Reference Bellaire, Pielmeier, Schneebeli and SchweizerBellaire and others (2009) used the same data to establish the instability criterion as for the accuracy.
Given that most studies involved many observers with varying experience under a variety of different conditions and revealed test accuracies of 70–90%, it is apparent that, even with a very experienced observer, in at least about 5–10% of the cases snow stability will not be predicted correctly by a single stability test.
Sources of Error
The relatively high number of false predictions undermines the usefulness of snow stability tests. What causes the false predictions? We propose that there are at least two sources of error. The first is related to the test method, the second to the variable nature of the snowpack. Obviously, all test methods are relatively crude methods that involve many subjective elements such as the way of loading. The support or tested area of some tests (e.g. the CTor shear frame) is too small to capture fracture propagation. Any test result will be specific for the test location since the slab as well as weak-layer properties may vary within the slope and be different on adjacent slopes. Hence an individual test result may well under- or overestimate stability. At present, the contribution of the two sources of error to the overall rate of false predictions is unclear. It has been suspected, for example by Reference Schweizer, Kronholm, Jamieson and BirkelandSchweizer and others (2008a), that, due to test errors, spatial variations of snow stability cannot be detected easily; this would mean that the two errors have similar magnitude.
Spatial variability studies that used snow stability tests in conjunction with the SMP may shed some light on the source of errors. Reference KronholmKronholm (2004) has provided the quartile coefficient of variation (QCV) for the stability test results as well as the weak-layer strength penetration resistance. The QCV of the stability test scores was in most cases about 30%, whereas it was only about 20% for the weak-layer penetration resistance. As the SMP is considered a high-precision instrument that has produced repeatable results, it can be assumed that at least about one-third of the observed variation in stability test scores was related to test errors and about two-thirds may reflect the real spatial variability of the snowpack. However, it has to be pointed out that the two methods have very different support: 0.09 m2 (CT) vs 2 × 10−5 m2 . In fact, one would expect the variation to increase with decreasing support. On the other hand, the variation in stability includes variations of both weak-layer and slab properties so that it is expected to be higher than the variation of weak-layer penetration resistance. The use of stability tests for spatial variability studies seems questionable given the obviously significant test error, but so far no alternative exists.
The uncertainty due to test errors can be decreased substantially if two tests adjacent to each other are conducted. In the case of the RB, Reference JamiesonJamieson (1995) has shown that in 97% of cases the test result is within a ±1 score of the slope median, at least on rather uniform slopes. This implies that the probability of the median of two independent tests being within one-half or one step of the slope median score is 0.91 or 0.99, respectively.
Reference Winkler and SchweizerWinkler and Schweizer (2009) found that with two adjacent ECTs which provide the same test result, the unweighted average accuracy increased from about 80% to about 90%. Reference Birkeland, Chabot and GleasonBirkeland and Chabot (2006) proposed that the false-stable error rate could be reduced from about 10% to about 1% by making a second test at a representative site beyond the correlation length from the first test and choosing the less stable of the two test results. As the correlation length is unknown, at least about 10 m has been proposed as the distance between two tests (Reference Jamieson, Johnston and ArmstrongJamieson and Johnston, 1993; Reference Schweizer, Kronholm, Jamieson and BirkelandSchweizer and others, 2008a).
Conclusions
Snow stability tests represent highly prized Class I data. They can be considered as indices of instability. They represent the only way to obtain information on: (1) layering; (2) failure initiation; and (3) fracture propagation (in the absence of obvious signs of instability). Despite obvious deficiencies, they are useful for assessing avalanche risk in backcountry operations as well as for operational forecasting of the regional avalanche danger, in particular in areas with persistent weak snowpack layers.
A good test method should predict stable and unstable conditions similarly well (sensitivity vs specificity). Combinations of test results (from the same or different methods) are useful, as exemplified by RB scores and release type. Comparisons across datasets require cautious interpretation. Nevertheless, with this approach, the ECT has generally higher unweighted average accuracy than other tests. On the other hand, comparisons within datasets suggest that the ECT, RB, PST (and the SMP) generally have a comparable accuracy but higher than the CT. This is likely because the areas of the weak layer tested by the ECT, RB and PST are large enough to represent fracture propagation, whereas the CT tests for fracture initiation in about 0.09 m2 of the weak layer, and hence has low specificity. The threshold sum provides no direct information about fracture initiation or propagation and has a lower unweighted average accuracy than any of the tests that fracture weak layers. This is consistent with Reference LaChapelleLaChapelle (1980) who stated that observations of snowpack mechanics were more directly related to avalanching than observations of stratigraphy.
Even with very experienced observers an error rate of at least about 5–10% has to be expected. Site selection and interpretation require experience. Stability tests are not foolproof, and decisions about travelling in avalanche terrain should not be based solely on stability test results. Obviously, test reliability increases when two adjacent tests are carried out. However, a second test on a different slope (or at a second site on the same slope which is more than the autocorrelation length from the first site) should be more useful than the same test repeated in the same snow pit.
While accuracy is relevant when selecting a test for various scales of forecasting or backcountry decisions, considerations such as the effective depth, required time and technical skill are also important.
Acknowledgements
We thank I. Moner and D. Gauthier for providing additional information on their datasets, and A. van Herwijnen for valuable suggestions on the manuscript. For financial and logistical support, B.J. is grateful to the Natural Sciences and Engineering Council of Canada, HeliCat Canada, the Canadian Avalanche Association, the Canada West Ski Areas Association, Mike Wiegele Helicopter Skiing, Backcountry Lodges of British Columbia, the Association of Canadian Mountain Guides, Parks Canada, the Canadian Ski Guide Association, and Teck Mining Company. J.S. acknowledges support by the European Commission under contract NEST-506 2005-PATH-COM-043386 (Triggering of instabilities in materials and geosystems (TRIGS)). Thorough reviewer comments helped to improve this paper.