1. Introduction
Since the advent of satellite remote sensing platforms in the 1970s, observational data has grown exponentially. In the span of decades, the polar community went from having on average one image a year of a polar study site, to having (potentially) multiple images a day. Accompanying this increase in observations is the need for efficient feature analysis.
Segmentation techniques are needed to quantify the frequency, size, and location of many different features in glaciology – for example, icebergs, terminus position, crevasses, surface water – particularly in rapidly changing polar regions. Traditional remote sensing indices (Normalized Difference Water Index, NDWI) or object detection algorithms have been successfully used to delineate features such as surface water (Chudley and others, Reference Chudley2021), terminus position (Liu and Jezek, Reference Liu and Jezek2004; Seale and others, Reference Seale, Christoffersen, Mugford and O'Leary2011) and icebergs (Sulak and others, Reference Sulak, Sutherland, Enderlin, Stearns and Hamilton2017; Moyer and others, Reference Moyer, Sutherland, Nienow and Sole2019). However, these techniques rely heavily on image pre-processing, sensor stability and homogeneous environments (e.g. seasonally variable snow or melt or sea ice in the background will impact the classification results). To take advantage of the range of satellite sensors imaging polar regions, a segmentation algorithm that is agnostic of sensor type, or seasonal shifts in the environment, is needed.
The resurgence of artificial intelligence in 2006 (Hinton and others, Reference Hinton, Osindero and Teh2006), followed by the success of AlexNet in 2012 (Krizhevsky and others, Reference Krizhevsky, Sutskever and Hinton2012), helped to jump-start machine learning and deep learning algorithms. Convolutional Neural Networks (CNNs) that focus on object detection, semantic segmentation, and instance segmentation provide a methodology inspired by the visual cortex to understand various scenes and identify specific objects. As a result, CNNs have recently been used to segment surface lakes (Yuan and others, Reference Yuan2020), crevasses (Lai and others, Reference Lai2020; Zhao and others, Reference Zhao, Liang, Li, Duan and Liang2022) icebergs (Bentes and others, Reference Bentes, Frost, Velotto and Tings2016; Rezvanbehbahani and others, Reference Rezvanbehbahani, Stearns, Keramati, Shankar and van der Veen2020), glacier termini (Krieger and Floricioiu, Reference Krieger and Floricioiu2017; Baumhoer and others, Reference Baumhoer, Dietz, Kneisel and Kuenzer2019; Mohajerani and others, Reference Mohajerani, Wood, Velicogna and Rignot2019; Zhang and others, Reference Zhang, Liu and Huang2019), and other features. However, a major roadblock in using CNNs is that they require large training datasets; a robust custom trained segmentation model may require 10 000 training labels. The absence of good training data greatly impacts the performance, and thus utility, of deep learning models in the earth sciences.
The recently-released Segment Anything Model (SAM) by Meta AI Research is a foundational model in the field of artificial intelligence (Fig. 1). Foundational models are deep learning models that are built using a large amount of unlabeled training data through self-supervised learning (Schneider, Reference Schneider2022). As a result, foundational models perform efficiently for instance and semantic segmentation, object classification and detection purposes. Since their inception in 2018, several versions of these large-scale models have been released, such as Dall-E 2 (Ramesh and others, Reference Ramesh, Dhariwal, Nichol, Chu and Chen2022) and GPT-3 (Brown and others, Reference Brown2020). The key advantage of foundational models is that they allow for generalization of the model through self-supervised learning and a minimum amount of training labels as compared to CNN models.
2. Methods
While SAM does not require training data, model performance can be improved by adding “prompts” (see Supplementary Figs. S1 & S2). Prompts allow the user to identify features of interest (and features that are not of interest) and can be either points or boxes. We quantify the performance of SAM using no-prompts and with-prompts by calculating the F1 score for each image. Our dataset, like most real-world datasets, is imbalanced (the number of features being detected is not evenly balanced by the background). As a result, the F1 metric most accurately represents model performance. The F1 score ranges between 0 and 1; segmentation results with an F1 score close to 1 are good. To prepare ground truth data for validation of the model, we created manual annotations using the V7 labs Darwin (V7Labs, 2023) application and the iPad Pro. The V7 Labs Darwin annotation tool along with the iPad Pro stylus improves the speed, accuracy, and control on labeling significantly (V7Labs, 2023).
Semantic segmentation is an important form of data extraction heavily used within cryosphere research. However, the complexity of segmenting glaciological features makes it difficult to create an automated segmentation approach. With SAM we do not incorporate any additional training data, or pre-process any imagery. There are currently three SAM encoders – ViT-B (Vision Transformer Base), ViT-L (Vision Transformer Large), ViT-H (Vision Transformer Huge) – which have varying numbers of parameter counts. We found that the ViT-L encoder for SAM model performs most consistently for our datasets, so all results are generated with the ViT-L encoder (see Supplementary Fig. S3).
2.1 Data
We acquire Sentinel-1 and Sentinel-2 imagery from Google Earth Engine (Gorelick and others, Reference Gorelick2017). The Sentinel-1 SAR image used in this study is obtained in Interferometric Wide Swath (IW) mode at a spatial resolution of 20 × 22 m (pixel spacing of 10 meters) and with HH, HV, and HHxHV polarization bands. The Planet imagery is obtained from the PlanetScope sensors accessed via the Planet data portal. PlanetScope images are at 3 m spatial resolution. The timelapse imagery is obtained from Stardot Technologies CAM-SEC5-B that has a Standard 4.5–10 mm Varifocal Lens (LEN-MV4510CS). Landsat-4, Landsat-5, and Landsat-8 images are downloaded through the USGS Earth Explorer. We chose these remote sensing platforms as they are commonly used datasets in glaciology research. All the optical images have an RGB band combination. To create a diverse dataset for this analysis, we use images from different regions of the Greenland Ice Sheet as shown in Figure 2.
2.2 Mask generation
For the no-prompt approach, individual instance segmentation results are generated by SAM. SAM can detect multiple feature instances within a single image such glacier termini, icebergs, fjord walls, snow, and water. However, the model is not an object detection model. The model does not recognize that these features are icebergs, sea-ice, terminus, land, water, or something else. The user needs to provide the context of what is present within the image, similar to how most deep-learning models operate. A potential future enhancement of SAM or a derivative of SAM can be to build an object detection model (like the object detection model “You Only Look Once” or YOLO), that identifies if something is an iceberg or glacier terminus. In our extraction of mask instances, we add all the instances detected to a new 2D array of the same shape as the original image. The 2D array will add foreground values (1 s) at location indices of every foreground instance detected. During the testing of the model, there are certain instances where the entire scene is detected as an object. To overcome this, we put in a condition to exclude such instances. We remove any instance larger than 25% of the original image as that suggests a background detection that is too large to be a feature of interest (e.g. an iceberg or supraglacial lake). For other glaciology features such as a glacier terminus, all the instances are saved as an image and the potential feature of interest (glacier terminus) is extracted from the stack of instances that were identified by SAM. The number of instances in such scenes are small, and is therefore a quick selection process. The terminus for the no-prompt segmentation classifies the glacier and land features on the edge of the image under a single class while the with-prompt segmentation classifies the mélange and land together.
For the with-prompt approach, we create two different shapefiles. One shapefile consists of prompt points representing the foreground or object of interest (value of 1) and the other consists of prompt points representing the background (value of 0). The location coordinates (the row and column values of each prompt) are added to the point shapefiles. Each shapefile is then read and coordinates are extracted as an array with corresponding labels of 1 and 0 s (ones and zeros).
Prompts are selected based on the foreground, or objects that will be segmented (1 s), and background (0 s). The necessity and number of prompts are dictated by the density of the objects of interest, the radiometric diversity of the objects in the dataset, and the complexity of the scenes. For example, icebergs and supraglacial lakes are small features (compared to glacier termini) and are found in varied and radiometrically complex backgrounds. This makes the selection process of foreground (1 s) and background (0 s) essential. Other features, such as crevasses, are narrow and sometimes hard to differentiate from the background. In these cases, we place prompts on the background (0 s) adjacent to the objects (1 s) to help the model recognize the importance of these characteristic gradients. For larger features, such as glacier termini, prompts are placed across the terminus to get a range of radiometrically different pixels. The final binary image is created based on the features that the model detects (1 s) and the background (0 s) (Fig. 3).
Validation
To determine the validity of the SAM model, we generate the F1 score for each scene, which is a useful metric for imbalanced datasets (Jeni and others, Reference Jeni, Cohn and De La Torre2013). The F1 score is quantified for each scene by comparing SAM results with our manual labels, using the following equations which account for true positives (TP), false positives (FP), and false negatives (FN).
The F1 score is the harmonic mean of the precision and recall. Precision focuses on minimizing the false positives in the dataset (i.e measuring the true positives in total positives predicted by the model). Recall focuses on minimizing the false negatives in the dataset. In other words, recall measures the correctly identified objects (positives) among the manually-identified objects.
The precision, recall, and F1 score are instrumental in determining the performance of a model in semantic segmentation. The F1 score is between 0 and 1. The closer it is to 1, the better the performance of the model, as it is an overall representation of a strong true positive and strong true negative score. However, it is difficult to set a threshold for an “acceptable” F1 scores, as it will depend on the study. Ideally a model aims to have a very high precision and recall to determine an overall strong F1 score. However, it is important to also assess the precision and recall values of the foreground and the background as this provides a way to determine if the performance of the model is being affected by higher false positives or higher false negatives or both. Another way of doing this, is creating a confusion matrix for each scene, which provides a visual representation of the number of true positives, true negatives, false positives, and false negatives.
3. Results
3.1 SAM performance across different sensors
Many polar remote sensing applications need to balance the acute trade-offs between consistent year-round imagery from Synthetic Aperture Radar (SAR) and long-term, high-resolution imagery from optical remote sensing sensors. Segmentation techniques that work across platforms are therefore critical in building robust datasets (e.g. Zhao and others, Reference Zhao, Liang, Li, Duan and Liang2022). Here, we assess SAM performance across imagery commonly used in glaciology (Fig. 4). Sentinel-1 is a polar-orbiting C-Band SAR with a spatial resolution of 20 × 22 m. Sentinel-2 is a polar-orbiting optical satellite with a ground resolution of 10 m. We also test SAM segmentation on a suite of Landsat satellites, namely Landsat-4 (60 m), Landsat-5 (30 m), and Landsat-8 (15 m). We include one optical CubeSat sensor, PlanetScope, which has a spatial resolution of 3 m. Finally, we explore the performance of SAM on in situ timelapse photographs. Result metrics are shown in Table 1.
.
In Figure 4a, automatic segmentation does provide good results on the high-resolution Planet image, but is impacted by false positives. These false positives are eliminated by adding 20 points as prompts (10 points identify the icebergs and 10 points identify the background), resulting in an F1 score of 0.91. We find similar improvements with the coarser Sentinel-2 imagery. Adding prompts improves the F1 score from 0.52 to 0.64 – in particular, the prompts help SAM detect smaller icebergs. Larger icebergs were detected successfully with both approaches. There are several small icebergs (hardly visible in Fig. 4b) that are missed even with the prompts, which suppresses the overall F1 score.
The Sentinel-1 image is an RGB composite made up of HH, HV, and HHxHV polarization bands. As with most SAR images, it is noisy, especially compared to the optical remote sensing images. Despite the noise, the no-prompt based approach successfully segments all prominent icebergs. Adding points improves the model performance, particularly because the background itself is so noisy.
Performance on the timelapse photograph is strong; no-prompt approach has an F1 score of 0.83 and with-prompts approach has an F1 score of 0.80. Depending on the range of pixel intensity throughout the image, and the gradients between features of interest and the background, detection can become complex. When the prompts are provided based on specific features, SAM includes such pixels and potential features as part of segmentation. However, in this manuscript we are only using 10 prompts for the foreground and 10 prompts for the background. So based on the complexity of the image scene, the prompts might not be sufficient. In Figure 4d, an increased number of prompts, across the range of gradients, might improve the performance of the with-prompt model.
The Landsat satellite system is the longest running satellite constellation and is widely used in glaciology research, so we also assess the performance capabilities of SAM with the multiple spatial resolutions that the Landsat satellite system provides. Here we present the segmentation capabilities of SAM on Landsat-4 (60 meters), Landsat-5 (30 meters), and Landsat-8 (15 meters) images from West Greenland (Fig. 5). We find that the SAM no-prompt approach performs better with lower spatial resolution imagery than the with-prompt approach. This result is evident in the F1 score as well as the precision and recall scores of the two approaches. As we transition to higher spatial resolution images of 15 meters in Landsat-8, we find that the with-prompt approach provides better performance and a higher F1 score. It is likely that the higher-resolution imagery provides stronger gradients between the background and the foreground.
3.2 SAM performance across different zoom levels
Segmentation results from SAM also depends on the size of the object relative to the size of the image. In other words, very small objects, surrounded by a lot of background, are hard to segment. Creating smaller sub-images (from the larger image), adjusts the relative size of the objects, thereby allowing SAM to segment these small objects that were previously discarded in the larger image (Fig. 6a). This approach provides the user control over the feature size of interest. The F1 score improves when zooming in because the model detects more of the smaller icebergs (which were included in the manual labels used to calculate the F1 score, see Supplementary Fig. S1).
Creating subsets of original images, and then mosaicking them together after SAM implementation, does create an additional pre- and post-processing step that is needed when working on larger regions of interest such as a fjord or basin.
3.3 SAM performance across different glaciology features
We assess the broader utility of SAM for cryosphere research by testing it on five different cryosphere features: crevasses, icebergs in sea ice, icebergs in mélange, supraglacial lakes, and a glacier terminus (Fig. 7). We use Sentinel-2 imagery for all the features except crevasses which are generally too narrow to segment in 10 m imagery. For crevasse segmentation, we use Planet imagery.
Our SAM results show that the with-prompt approach provides highly accurate results particularly for supraglacial lakes and terminus positions. Crevasses were essentially undetectable without any prompts, and the F1 score improved to 0.44 with prompts. There are several narrow and short crevasses that, even with 20 prompts, SAM did not detect. It is likely that additional prompts, or a zoomed-in image, would improve this performance. The F1 score for icebergs in sea-ice was consistently strong both with and without prompts (0.88). Icebergs in the mélange were better-detected without prompts (0.78) than with prompts (0.71); in this scenario, the similarity between background and features caused the prompted model to over-estimate the features. In the supraglacial lakes example, we find that due to the presence of two large false-positives in the predicted image, the F1 score of SAM is low at 0.48. The precision and recall scores show that the precision of the model is low for foreground detection (supraglacial lakes) due to the false positives, thereby impacting the overall F1-score. Adding prompts improves the model substantially.
For iceberg segmentation, no-prompt SAM segmentation provides a fast and fairly consistent result across all sensors, spatial resolutions, and environmental conditions such as open water, sea-ice, and mélange. For the prompt-based approach, prompts were placed in radiometrically different locations to make sure that the model gets a range of sampling. For all features, prompts generally produce better SAM segmentation results.
4. Discussion and conclusion
SAM as a foundational model has been trained on unlabeled training data through self-supervised learning. This training allows the model to be generalized. Additionally, the training dataset for SAM is comprised of 10 million images and over 1.1 billion masks, thereby creating very diverse training data. Convolutional Neural Networks (CNNs) are conventionally trained on large numbers of training data that allows the model to successfully segment objects within an image. However, the overhead of computational efficiency, labeling large and diverse training data, and having enough convolutional layers, makes implementation of the CNN models challenging.
Our implementation of the SAM model shows that it is a robust segmentation model with adaptability across different satellite sensors in no-prompt and with-prompt workflows. In noisy images, such as Sentinel-1 SAR, we find that the no-prompt approach identifies all major icebergs in images robustly; using prompts helped the model also detect smaller icebergs along with the prominent icebergs. This is a huge advantage for the SAR community in climate and Earth science, as the adaptability of the SAM model to produce semantic and instance segmentation datasets promotes data fusion workflows, thereby resulting in an overall improvement of temporal resolution. Another important example showing the adaptability of the SAM model is in identifying icebergs in timelapse photographs as shown in Figure 4d. Timelapse is an extremely popular form of imagery in polar research and is used in classification, kinematics, and feature tracking (Messerli and Grinsted, Reference Messerli and Grinsted2015; Giordan and others, Reference Giordan, Dematteis, Allasia and Motta2020). Both the no-prompt and with-prompt segmentation results were strong with an F1 score of 0.83 and 0.80 respectively.
In Rezvanbehbahani and others (Reference Rezvanbehbahani, Stearns, Keramati, Shankar and van der Veen2020) iceberg segmentation is done using the CNN model UNet, applied to PlanetScope imagery. In that study, the F1 score of the iceberg semantic segmentation is 0.89 after extensive training on more than 10 000 manually-digitized iceberg labels. In comparison, semantic segmentation of icebergs on PlanetScope imagery using SAM is 0.87 with no-prompt approach and 0.91 in with-prompt approach. Our results are similar to the highly trained CNN UNet model, without any training data added. With minimum to no-training, successful semantic segmentation of complex glaciological features can be done for images that the model has not been trained.
An interesting aspect in the assessment of SAM was with the implementation of no-prompt and with-prompt approach. We assessed the performance of both the approaches in different conditions, spatial resolutions, and glaciological features. Our results show that, in general, the no-prompt approach works well for images across a wide range of spatial resolutions. A uniform distribution of objects within the image helps provide consistent segmentation results from SAM. However, when this distribution changes or if the objects do not have a strong gradient, then results from the SAM no-prompt approach are more susceptible to false positives. The with-prompt approach provides more support for non-uniform distribution of objects within the images. The with-prompt approach does provide high amount of control in delineation of objects in close proximity or of objects that are difficult to detect, such as crevasses. Low-resolution images, such as Landsat-4 (60 m) and with objects in subtle gradients such as icebergs in mélange, the with-prompt approach was unable to delineate objects. This shortcoming can also be due to the limited number of prompts for the foreground and the backgrounds. For such images increasing the number of prompts and improving the location of prompts will likely improve segmentation results.
We find that SAM is highly adaptable across the different image types that we compared in this study. Results of SAM for PlanetScope, Sentinel-1, Sentinel-2, Landsat-4, Landsat-5, Landsat-8, and timelapse photographs for multiple glaciological features with minimum to no user input, are encouraging. This study shows that developing a segmented dataset across multiple remote sensing platforms is feasible, even in the absence of labeled datasets. Additionally, new remote sensing datasets can be included without sacrificing pre-existing workflows.
4.1 Conclusion
SAM's ability to identify features in simple and complex images means that high temporal resolution datasets can be created by combining segmentation results from optical and SAR remote sensing imagery. For larger regions where the coverage area is several square kilometers, we find that creating subsets of the image by grids aid the model in focusing on smaller features such as lakes and icebergs. This additional step will allow study of glaciological features in more detail as well as scaling SAM to process regions as large as an ice-sheet.
Overall, SAM does provide a comprehensive approach of implementing deep learning in glaciology with faster setup, high accuracy, and minimum to no user input to generate robust segmentation results for different use cases. The no-prompt approach provides consistent results for different features, image types, and spatial resolutions. However, images where foreground object gradients are subtle or features are not as pronounced (e.g. crevasses), the no-prompt approach is unable to segment successfully. The with-prompt approach provides greater control for the object segmentation in such images. Low resolution images however do limit segmentation of smaller features such as icebergs. A potential work around to overcome such limitations of with-prompt approach can be to increase the number of prompts. This will improve the data diversity and potentially aid the model in identifying features. The dearth of good training holds true for a lot of state of the art deep learning studies in glaciology, thereby limiting the generalizability of such models. For studies where a state of the art model is preferred, SAM can act as an efficient tool for generating large amounts of training data thereby enabling creation of a more generalized model.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/jog.2023.95
Data
Sentinel-1 and Sentinel-2 imagery used in this study is available via Google Earth Engine (https://code.earthengine.google.com/) and accessed from the Google Earth Engine data catalog (https://developers.google.com/earth-engine/datasets/). Planet imagery is accessed via Planet's data portal (https://www.planet.com/). The image IDs that are used in this study are available in the Supplementary Table S1.
The code developed for extraction of features of interest has been made available at https://github.com/leigh-stearns/segment-anything
Acknowledgements
We would like to acknowledge the Scientific Editor of the Journal of Glaciology, Prof. Hester Jiskoot and both reviewers for their suggestions and constructive criticism. Their support greatly improved the quality of this manuscript.
Author's contributions
S.S conceptualized and developed the study. S.S. and L.A.S designed and implemented the workflow. S.S, L.A.S, and C.J.v.d.V contributed to analyzing the results. S.S led the writing of the manuscript with contributions from all authors.
Financial support
S.S and L.A.S were supported by NASA grant NNX16AJ90G.