SPSVO: a self-supervised surgical perception stereo visual odometer for endoscopy

Junjie Zhao; Yang Luo; Qimin Li; Natalie Baddour; Md Sulayman Hossen

doi:10.1017/S026357472300125X

SPSVO: a self-supervised surgical perception stereo visual odometer for endoscopy

Published online by Cambridge University Press: 29 September 2023

Junjie Zhao ,

Yang Luo

Qimin Li ,

Natalie Baddour and

Md Sulayman Hossen

Show author details

Junjie Zhao: Affiliation:
College of Mechanical and Vehicle Engineering, Chongqing University, Chongqing, China
Yang Luo*: Affiliation:
College of Mechanical and Vehicle Engineering, Chongqing University, Chongqing, China
Qimin Li: Affiliation:
College of Mechanical and Vehicle Engineering, Chongqing University, Chongqing, China
Natalie Baddour: Affiliation:
Department of Mechanical Engineering, University of Ottawa, Ottawa, ON, Canada
Md Sulayman Hossen: Affiliation:
College of civil engineering Chongqing university, Chongqing University, Chongqing, China
*: Corresponding author: Yang Luo; Email: [email protected].

Article contents

Abstract
Introduction
2. Related work
3. Proposed SPSVO approach
Experimental validation of the proposed SPSVO method
Conclusions
Author contribution
Financial support
Competing interests
Ethical approval
Data availability statement
References

Rights & Permissions

Abstract

Accurate tracking and reconstruction of surgical scenes is a critical enabling technology toward autonomous robotic surgery. In endoscopic examinations, computer vision has provided assistance in many aspects, such as aiding in diagnosis or scene reconstruction. Estimation of camera motion and scene reconstruction from intra-abdominal images are challenging due to irregular illumination and weak texture of endoscopic images. Current surgical 3D perception algorithms for camera and object pose estimation rely on geometric information (e.g., points, lines, and surfaces) obtained from optical images. Unfortunately, standard hand-crafted local features for pose estimation usually do not perform well in laparoscopic environments. In this paper, a novel self-supervised Surgical Perception Stereo Visual Odometer (SPSVO) framework is proposed to accurately estimate endoscopic pose and better assist surgeons in locating and diagnosing lesions. The proposed SPSVO system combines a self-learning feature extraction method and a self-supervised matching procedure to overcome the adverse effects of irregular illumination in endoscopic images. The framework of the proposed SPSVO includes image pre-processing, feature extraction, stereo matching, feature tracking, keyframe selection, and pose graph optimization. The SPSVO can simultaneously associate the appearance of extracted feature points and textural information for fast and accurate feature tracking. A nonlinear pose graph optimization method is adopted to facilitate the backend process. The effectiveness of the proposed SPSVO framework is demonstrated on a public endoscopic dataset, with the obtained root mean square error of trajectory tracking reaching 0.278 to 0.690 mm. The computation speed of the proposed SPSVO system can reach 71ms per frame.

Keywords

Visual odometer (VO)virtual endoscopy self-supervision pose estimation

Type: Research Article
Information: Robotica , Volume 41 , Issue 12 , December 2023 , pp. 3724 - 3745

DOI: https://doi.org/10.1017/S026357472300125X [Opens in a new window]
Copyright: © Chongqing University, 2023. Published by Cambridge University Press

1. Introduction

Gastrointestinal cancer is the second leading cause of cancer death in the world and accounts for about 35% of all cancer-related deaths [Reference Bernhardt, Nicolau, Soler and Doignon1, Reference Shao, Pei, Chen, Zhu, Wu, Sun and Zhang2]. Some hospitals are now equipped with two-dimensional endoscopic instruments for doctors, such as the da Vinci® surgical system (Intuitive Surgical, Inc., Sunnyvale, CA), to assist in performing minimally invasive surgery (MIS) of the gastrointestinal tract, abdominal cavity, chest cavity, and throat. The most direct and effective screening for gastrointestinal cancers is two-dimensional endoscopy, such as capsule endoscopy, upper gastrointestinal endoscopy, and colonoscopy [Reference Feuerstein3–Reference Low, Morris, Matsumoto, Stokken, O’Brien and Choby6].

In traditional endovascular MIS processes, the position of diseased tissue is generally estimated by visually examining 2D endoscope images. However, the endoscope images usually lack sufficient texture. When combined with irregular illumination, extensive, similar areas, and low contrast, it becomes difficult for surgeons to quickly and accurately locate lesions. Other problems due to hand-eye coordination and visual misdirection may also occur during operation [Reference Afifi, Takada, Yoshimura and Nakaguchi7]. Recently, computer vision-based algorithms have attracted much attention for success in stereoscopic endoscope position tracking and providing intraoperative reconstruction of surgical scenes. Tatar et al. [Reference Tatar, Mollinger, Den Dulk, van Duyl, Goosen and Bossche8] attempted to use a depth camera combined with a time-of-flight method to locate positions of surgical instruments. Lamata et al. [Reference Lamata, Morvan and Reimers9] investigated the features (mutual reflection, diffuse reflection, highlight parts, and colors) of human liver photos based on the Lambert-body method and tried to reconstruct a 3D model of the liver by adjusting the albedo and light intensity of the endoscopic images. Wu et al. [Reference Wu, Sun and Chang10] aimed to track geometric constraints of surgical instruments and reconstruct 3D structures from 2D endoscopic images with a constrained decomposition method. Seshamani et al. [Reference Seshamani, Lau and Hager11] combined a video mosaic method and an online processing technique to expand the field of view to better assist surgeons in performing surgeries and lesion diagnosis. Due to the complex features of an enterocele, endoscopic images often have strong illumination variation and feature sparsity, resulting in difficulties for the aforementioned methods to realize precise organ 3D reconstruction and lesion localization.

Recently, the Structure from Motion (SfM) approach was proposed to construct high-quality 3D models of human organs based on endoscopic images. The SfM approach mainly consists of feature extraction, keypoint matching, attitude estimation, and beam adjustment. Based on the SfM technique, Thormaehlen et al. [Reference Thormahlen, Broszio and Meier12] generated a 3D model of the human colon with surface texture features. Koppel et al. [Reference Koppel, Chen and Wang13] developed an automated SfM approach to reconstruct a 3D model of the colon from endoscopic images to assist surgeons in surgical planning. Mirota et al. [Reference Mirota, Wang and Taylor14] proposed a direct SfM approach to track endoscope position using video data to improve the accuracy of Endonasal Skull Base Surgery navigation. Kaufman et al. [Reference Kaufman and Wang15] applied a direct Shape from Shading (SfS) algorithm to better extract detailed information of surface textures from endoscopic images and combined the SfM method to reconstruct a refined 3D model of human organs. Assisted by manual drawing of the outline of the major colonic folds, Hong et al. [Reference Hong, Tavanapong and Wong16] reconstructed a virtual colon segment based on an individual colonoscopy image to aid surgeons in detecting colorectal cancer lesions. However, accurate reconstruction of human organs based on SfM methods requires stable camera motion since it needs to match feature points between multiple images and calculate the camera pose. Furthermore, data obtained from sensors such as monocular cameras, Inertial Measurement Units, ultrasonic lidar, etc., are usually large, thus requiring computing resources to perform batch data processing. Hence, SfM techniques are usually applied offline. For actual surgical operation, real-time feedback plays an important role in providing surgeons with timely and accurate information to allow them to make optimal decisions and adapt their approach as necessary during the procedure. A real-time online computer vision-based algorithm is hence highly desirable to improve accuracy and precision of surgical interventions and reduce the risk of complications or adverse outcomes.

The Visual Simultaneous Localization and Mapping (VSLAM) method is a real-time online data processing technique which requires less computing resources compared to the SfM approach. VSLAM utilizes endoscopic video or image sequences to estimate the pose and location of the endoscope and to reconstruct the abdominal cavity and other scenes of the MIS [Reference Jang, Han, Yoon, Jai and Choi17–Reference Xie, Yao, Wang and Liu19]. The goal of VSLAM is to improve the visual perception of surgeons, and it plays an important role in developing online surgical navigation systems and medical augmented reality technology. Much research in recent years has focused on improving the accuracy and efficiency of VSLAM methods for medical applications, particularly in the context of MIS systems. Mountney et al. [Reference Mountney, Stoyanov and Davison20] first explored the application of VSLAM in MIS by extending the Extended Kalman Filter SLAM (EKF-SLAM) framework to handle complex light reflection and low-texture environments. However, the obtained point clouds were too sparse and could not represent 3D shapes and detailed surface textures of human organs. Mountney and Yang [Reference Mountney and Yang21] proposed a novel VSLAM method to online estimate tissue deformation and motion of the laparoscopic camera by establishing a periodic tissue deformation parameter model and generating a joint registered 3D map with preoperative data. However, the slow speed of the system’s map-building algorithm can lead to poor real-time tracking and loss of feature points. In [Reference Klein and Murray22], Klein and Murray proposed a Parallel Tracking and Mapping (PTAM) algorithm, a monocular VSLAM approach based on keyframes. The PTAM can run in real time on a single CPU and handle large-scale environments and a variety of lighting conditions. However, it requires high-quality feature detection and feature matching for camera locating and scene mapping.

The aforementioned methods are generally based on monocular endoscopes, where it is difficult to process endoscopic images with small viewing angles and rapid frame transitions. Lin et al. [Reference Lin, Johnson and Qian23] extended the application scope of PTAM to stereo endoscopy, which allows for simultaneous stereoscopic tracking, 3D reconstruction, detection of deformation points in the MIS setting and can generate denser 3D maps compared to EKF-SLAM methods. However, this stereo system suffers from time-consuming feature point matching. Later, Lin et al. [Reference Lin, Sun, Sanchez and Qian24] improved texture feature selection, green channel selection, and reflective area processing of the endoscopic images and proposed a revised VSLAM method to restore the surface structure of a 3D scene of abdominal surgery based on SLAM. However, the proposed method relies heavily on tissue surface vascular texture. In cases where the tissue being imaged has little or no vascularity, this method may not be effective in detecting unique features. Recently, Mur-Artal [Reference Mur-Artal, Montiel J.M. and Tardos25] provided an ORBSLAM system constructed via a robust camera tracking and mapping estimator with remarkable camera relocation capabilities. Mahmoud [Reference Mahmoud, Cirauqui and Hostettler26] applied the ORBSLAM algorithm to track the position of the endoscope without additional tracking elements and provide 3D reconstruction in real time. This extended the ORBSLAM to reconstruct semi-dense maps of soft organs. However, although the above two ORBSLAM methods based on feature point approaches reduce computational complexity, the reduction in the amount of information compared to the original graph also implies that some useful information is lost. While the two ORBSLAM methods reduce computational complexity, the reduction in useful information can lead to inaccurate camera location and visceral surface texture mapping.

Feature point detection is a fundamental and important processing step in Visual Odometry (VO) or VSLAM. Local features, such as the Scale Invariant Feature Transform (SIFT), Speed Up Robust Feature (SUFT), Oriented FAST, and Rotated BRIEF (ORB), for camera pose estimation are commonly hand-crafted by calling OpenCV algorithms from a third-party function library. However, the feature points extracted by these algorithms are often unevenly distributed, with large amounts of useful data lost, resulting in inaccurate camera positioning and scene mapping [Reference Mahmoud, Collins, Hostettler, Soler, Doignon and Montiel27–Reference Rublee, Rabaud and Konolige29]. Moreover, the surface of the human viscera often has poor texture. Endoscope images often have a small field of view and are commonly taken with lighting changes and specular reflection, Fig. 1. Weak textures and specular reflections pose challenges to VSLAM [Reference Mahmoud, Collins, Hostettler, Soler, Doignon and Montiel27], making many SfM or SLAM frameworks such as ORB-SLAM3 [Reference Campos, Elvira and Rodriguez30] ineffective in these situations. In this paper, a self-supervised feature extraction method “SuperPoint” [Reference Detone, Malisiewicz and Rabinovich31] and a matching feature technique “SuperGlue” are applied to address challenges such as illumination changes, weak textures, and specular reflections in the human viscera. Moreover, this approach accelerates convolutional Neural Network (CNN) computations to enable real-time endoscopic pose estimation and viscera surface map construction.

Figure 1. Frames from “colon_reconstruction_dataset.” (a) a Small field of view, (b) specular reflections, (c) lighting changes.

Feature matching is another critical step in feature-based VO or SLAM techniques. This involves finding the same features in two images and establishing correspondences between them to achieve camera pose estimation and map updates. The performance of the feature-matching process directly affects the accuracy and stability of the VO or SLAM system. Chang et al. [Reference Chang, Stoyanov and Davison32] used feature matching to perform heart surface sparse reconstruction through structural propagation. The algorithm obtained parallax data between point pairs to estimate stereo parallax of each frame and motion information between consecutive frames. However, the method obtained a sparse parallax field, and further complex interpolation calculations were required to obtain a denser reconstructed scene of the heart surface. Lin et al. [Reference Lin, Sun and Sanchez33] utilized a vessel-based line-matching approach based on block-matching geometry to avoid pixel-wise matching. However, the application of local characteristics of image features of the viscera can lead to mismatched point pairs and thus incorrect camera location. Direct methods such as DSO [Reference Engel, Koltun and Cremers34] or DSM [Reference Zubizarreta, Aguinaga and Montiel35] and hybrid methods such as SVO [Reference Forster, Pizzoli and Scaramuzza36] assume that ambient illumination remains constant, which is difficult to ensure due to severe illumination variations of endoscopic images. The Self-Supervised Learning (SSL) approach can match images by using image content itself as supervision, without requiring explicit labels or annotations. SSL methods have shown promising performance in image-matching tasks such as stereo matching and optical flow estimation of real-life scenarios and have enhanced robustness to local illumination changes [Reference Sarlin, Detone and Malisiewicz37]. However, the performance of SSL in endoscopic image matching is unknown and remains to be studied. This paper proposes an improved SSL method with adaptive deep learning to address data association between endoscopic images.

This paper introduces SPSVO, a self-supervised surgical perception stereo visual odometer for endoscope pose (position and rotation) estimation and scene reconstruction. The proposed method overcomes adverse effects of endoscopic images on feature extraction and tracking, such as irregular illumination, poor surface texture, low contrast and extensive, and similar areas. The main contributions of this paper are as follows:

• A VO system is proposed that integrates a SuperPoint feature extraction method based on CNN and a SuperGlue feature-matching network. The SPSVO system enables extraction of enriched feature points compared with common hand-crafted local feature-detecting methods, such as ORB, SIFT, and SUFT.
• An image illumination pre-processing technique is proposed to address mirror reflection and illumination variations of endoscopic images.
• The SPSVO system includes image pre-processing, feature extraction, stereo matching, feature tracking, keyframe selection, and pose graph optimization.
• The performance of the proposed system is evaluated based on a public dataset: “colon_reconstruction_dataset” [Reference Zhang, Zhao and Huang38]. Results indicate that the proposed system outperforms ORB-SLAM2 [Reference Mur-Artal and Tardós39] and ORB_SLAM2_Endoscopy [40] methods in feature detection and tracking. ORB-SLAM2 cannot extract sufficient feature points to initialize the scene map of viscera and thus results in loss of the endoscope track.
• The proposed system is capable of accurate and rapid operation within the human viscera; the computation speed of the SPSVO system is as fast as 131ms per frame, enabling real-time surgical navigation.

The rest of this paper is organized as follows: Section 2 presents related work on endoscopic VSLAM methods. Section 3 presents the proposed SPSVO system. Section 4 presents experimental results and analysis. Finally, conclusions are drawn in Section 5.

2. Related work

2.1. VSLAM and VO for endoscopy

VSLAM is a technique that uses camera vision for simultaneous robot self-locating and scene map construction [Reference Durrant-Whyte and Bailey41]. It enables autonomous robot exploration in unknown or partially unknown environments. The architecture of a classical VSLAM system typically includes a front-end visual odometer, backend optimization, loop closure detection, and finally mapping, as shown in Fig. 2.

Figure 2. Classic VSLAM framework.

VSLAM has the potential to estimate the relative pose of the endoscope camera and construct a viscera surface texture map, which is important for lesion localization and surgical navigation. However, complicated intraoperative scenes (e.g., deformable targets, surface texture, sparsity of visual features, viscera specular reflection, etc.) and strict accuracy requirements have posed challenges to the application of VSLAM to minimally invasive surgery. Recently, Lamarca et al. [Reference Bartoli, Montiel and Lamarca42] proposed a monocular non-rigid SLAM method that combines shape from template (SfT) and non-rigid structure from motion (NRSfM) methods for non-rigid environment scene map construction. However, this method is susceptible to variations in illumination and does not perform well under poor visual texture conditions, rendering it unsuitable for reconstruction of viscera with non-isometric deformations. Later, Gong et al. [Reference Gong, Chen and Li43] constructed an online tracking and relocation framework which employs a rotation invariant Haar-like descriptor and a simplified random forest discriminator to select and track the target region for gastrointestinal biopsy images. Song [Reference Song, Wang and Zhao44] constructed a real-time SLAM system to address scope-tracking problems through an improved localization technique. Much work has focused on adapting VSLAM to enable application to an endoscopic scene, addressing problems such as poor texture [Reference Wei, Feng and Li45, Reference Song, Zhu and Lin46], narrow field of view [Reference Seshamani, Lau and Hager11], and specular reflections [Reference Wei, Yang and Shi47]. Still, the variable illumination problem remains unaddressed. Intraoperative scenarios require accurate camera localization; complex viscera images can lead to mismatched point pairs and thus incorrect camera location. Data association also remains a challenging problem for VSLAM systems in MIS scenarios [Reference Yadav and Kala48]. This paper focuses on addressing the problems of variable illumination and data association for intraoperative scenes.

2.2. SLAM based on SuperPoint and SuperGlue

CNNs have made outstanding achievements in computer vision to aid lesion diagnosis or intraoperative scene reconstruction [Reference Yadav and Kala48–Reference Li, Shi and Long52]. Researchers have studied and improved many aspects of VSLAM with learning-based feature extraction techniques to address variable illumination and poor visceral surface texture in complex surgical scenarios [Reference Liu, Zheng and Killeen53, Reference Liu, Li and Ishii54]. Bruno et al. [Reference Bruno H.M. and Colombini49] presented a novel hybrid VSLAM algorithm based on a Learned Invariant Feature Transform network to perform feature extraction in a traditional backend based on an ORB-SLAM system. Li et al. [Reference Li, Shi and Long52] attempted to use an end-to-end deep CNN in VSLAM to extract local descriptors and global descriptors from endoscopic images for pose estimation. Schmidt et al. [Reference Schmidt and Salcudean50] proposed Real-Time Rotated descriptor (ReTRo), which was more effective than classical descriptors and allowed for the development of surgical tracking and mapping frameworks. However, the aforementioned methods are based on traditional Fast Library for Approximate Nearest Neighbors (FLANN) techniques to track keypoints and match extracted features. FLANN does not perform well at feature point matching of high-similarity images, resulting in mismatches between extracted new features and potential features. Its performance is even worse under variable illumination; therefore, FLANN is not always applicable for MIS [Reference Muja and Lowe55].

This paper proposes to apply a SuperPoint approach for keypoint detection and to utilize the SuperGlue technique to deal with complex data associations in intraoperative scenes. SuperPoint [Reference Detone, Malisiewicz and Rabinovich31] is a self-supervised framework for detecting features and describing points of interest, while SuperGlue [Reference Sarlin, Detone and Malisiewicz37] is a network that can simultaneously filter outliers and match features. Recently, researchers have studied the effectiveness of SuperPoint and SuperGlue in VSLAM systems for MIS [Reference Barbed, Chadebecq and Morlana56, Reference Sarlin, Cadena and Siegwart57]. Barbed et al. [Reference Barbed, Chadebecq and Morlana56] demonstrated that SuperPoint delivers better feature detection in VSLAM than using hand-crafted local features. Laura et al. [Reference Oliva Maza, Steidle and Klodmann58] applied SuperPoint to a monocular VSLAM system to estimate the pose of the ureteroscope tip. Sarlin et al. [Reference Sarlin, Cadena and Siegwart57] proposed a Hierarchical Feature Network (HF-Net) algorithm based on SuperPoint and SuperGlue to predict local features and global descriptors for a 6-DoF localization of the camera. However, existing algorithms require substantial computing power to run in real time, which presents a significant obstacle to building maps in real time. In this work, a SPSVO algorithm is proposed to accelerate the CNN to realize real-time endoscopic pose estimation and viscera surface map construction.

3. Proposed SPSVO approach

3.1. System overview

The proposed SPSVO approach consists of four main modules: feature extraction, stereo matching, keyframe selection, and pose graph optimization, as shown in Fig. 3. The SPSVO can perform feature matching and keypoint tracking between stereo images and images in different frames, and can avoid incorrect data associations by using matching results of relevant key points. For real-time performance, the SPSVO performs feature tracking of images from only the left eye to reduce computation time. Nvidia TensorRT Toolkit is used to accelerate feature extraction and matching. On the backend, the SPSVO uses a traditional pose graph optimization framework for map construction. The above modules are designed to enable real-time application of the SPSVO within human enterococci and achieve accurate tracking by combining the efficiency of traditional optimization methods and the robustness of learning-based techniques.

Figure 3. Structure of the SPSVO system.

3.2. Image pre-processing

For image pre-processing, the SPSVO uses Contrast-Constrained Adaptive Histogram Equalization (CLAHE) [Reference Zuiderveld59] to enhance contrast, brightness, details, and texture of the input image. Due to severe variability in illumination in optical colonoscopy, some parts of the L-channel color space of the image are overexposed, resulting in image specular reflections, while some images are underexposed and lead to dark areas. In this work, pixels with a luminance greater than 50 are marked as reflective regions, and pixel values in the reflective region are set to the average of surrounding pixels. Possible noise is eliminated by a morphological closure operation. Performance of the CLAHE is demonstrated in Fig. 4. The proposed CLAHE effectively improves the uniformity of illumination and improves the contrast of endoscopic images. Due to the proximity of the endoscope light source to the inner wall of the organ and rapid movement of the endoscope, this pre-processing step allows the system to eliminate the effects of mismatches caused by specular reflections.

Figure 4. Image pre-processing. (a) Original image and (b) image after using CLAHE.

3.3. Proposed SuperPoint model

The SuperPoint network consists of four parts: encoding network, feature point detection network, descriptor detection network, and loss function. The encoder network converts the input image into a high-dimensional tensor representation for the decoder, making it easier to detect and describe key points. The feature point detection network is a decoding structure that calculates keypoint probability for each pixel and embeds a sub-pixel convolution algorithm to reduce computational effort. The descriptor detection network is also a decoding structure that extracts semi-dense descriptors first, performs a bicubic interpolation algorithm to obtain full descriptors, and uses L2-normalization to obtain unit-length descriptors. The loss function is a measure of the difference between the network output and ground truth label, guiding the network to optimize and improve its performance in detecting and describing key points of the input image. This provides better performance for related applications such as VSLAM, 3D reconstruction, and autonomous navigation. The SuperPoint network is trained in PyTorch. The input of the SuperPoint network is a single image I with $I\in R^{H\times W}$ , where $H$ is the height and $W$ is the width of the image, in pixels. The output of the network is positions of key points extracted in each image and their corresponding descriptors.

Based on Barbed [Reference Barbed, Chadebecq and Morlana56], the loss function can be expressed as

(1)

\begin{align} L_{SP}\left(X,X',D,D'\mathit{;}\,Y,Y',S\right)=L_{P}\left(X,Y\right)+L_{P}\left(X',Y'\right)+\lambda L_{d}\left(D,D',S\right) \end{align}

where the X and X´ are outputs of the original detection header of image I and warped image I´, respectively. The associated detection pseudo-labels are Y and Y.´; D and D´ are outputs of the raw description header. $S\in R^{H\mathit{/}8\times W\mathit{/}8\times H\mathit{/}8\times H\mathit{/}8}$ is the homography estimation matrix. L _P represents the loss of feature points during detection, which can be used to measure the difference between detected outputs and the pseudo-label. The L _d is the loss function of the descriptor; λ is a weight parameter used to balance the weight of L _p and L _d.

As shown in Fig. 1(b), there are generally multiple specular reflection areas (white spot areas) that exist in an endoscopic image. Most existing feature detection methods tend to detect many feature points around contour areas or specular reflection areas [Reference Barbed, Chadebecq and Morlana56]. For VSLAM, the more evenly the feature points are distributed in the image, the more accurately feature matching can estimate spatial pose relation. To make feature points extracted by SuperPoint evenly distributed in the region of interest, the specularity loss (L _S), which reconsiders weights of all extracted key points in specular regions, is proposed. The revised loss function is defined as

(2)

\begin{align} L_{ESP}\left(I,I',X,X',D,D';\,Y,Y',S\right)=L_{SP}\left(\ldots \right)+\lambda _{S}L_{S}\left(X,I\right)+\lambda _{S}L_{S}\left(X,I'\right), \end{align}

in which λ _S is a scale weighting factor determined by characteristics of the dataset and contribution of each objective function to the model performance. In this work, $\lambda _{s}=100$ . The L _S is defined as

(3)

\begin{align} L_{S}\left(X,I\right)=\frac{\sum \begin{array}{l} H,W\\ h,w=1 \end{array}\left[m\left(I\right)_{hw}\cdot d2s\left(\textit{softmd}\left(X\right)\right)_{hw}\right]}{\varepsilon +\sum \begin{array}{l} H,W\\ h,w=1 \end{array}m\left(I\right)_{hw}}, \end{align}

where softmd() and d2s() are SoftMax functions. The $\varepsilon$ is a constant with $\varepsilon =10^{-10}$ [Reference Detone, Malisiewicz and Rabinovich31, Reference Barbed, Chadebecq and Morlana56]. The $m(I)_{hw}$ is a weighting mask, where $m(I)_{hw}\gt 0$ for pixels near a specularity and 0 otherwise. The value of L _S is close to zero when there is no key point at that location.

The default thresholds of the parameters of the ORB-SLAM2 and SPSVO are determined based on [Reference Campos, Elvira and Rodriguez30, Reference Xu, Hao and Wang51], as shown in Table 1. The algorithms were run with default thresholds at the beginning and calibrated by comparing with the results of the ground truth values through increasing or decreasing the thresholds. In this work, ±40% variations were made with respect to the default thresholds. Figure 5 shows the comparison of the number of keypoints matched per keyframe with feature points threshold of 1600, as can be observed that the proposed SPSVO outperforms the ORB-SLAM2 in terms of matched feature points (approximately 700 points versus 500 points).

Figure 5. Comparison of number of keypoints matched per keyframe with feature points threshold of 1600.

Table I. The parameters used in ORB-SLAM2 and SPSVO are presented. Default parameters indicate in italics, the optimal parameters indicate in bold. Parameters A, B, C, and D all use default values in order to follow the principle of variable control. A: Scale factor between levels in ORB scale pyramid. B: Number of levels in ORB scale pyramid. C: Initial response threshold of FAST detector. D: Minimum response threshold of FAST detector.

Comparison of the distribution of feature points extracted by the SPSVO algorithm and ORB-SLAM2 on the “colon_reconstruction_dataset” [Reference Zhang, Zhao and Huang38] is shown in Fig. 6; the image resolution is $480\times 640$ . According to the results of Table 1, the upper threshold of feature point extraction is set to 1600 to ensure that both algorithms have the potential to obtain perfect system performance in most scenarios. It can be seen that the SPSVO extracts more effective features than ORB-SLAM2. The large number and even distribution of feature points will provide more scene information, thus improving the accuracy of camera localization. Furthermore, the feature points extracted by SPSVO are evenly distributed and located in textured areas, which is beneficial for subsequent VSLAM tasks such as keypoint matching, camera localization, map construction, and path planning.

Figure 6. Comparison of feature extraction results of the ORB-SLAM2 and SPSVO methods. (a) ORB-SLAM2 and (b) SPSVO.

3.4. Feature matching

The SuperGlue algorithm is commonly applied to simultaneously address feature matching and outlier filtering for real-time pose estimation in indoor and outdoor environments [Reference Jang, Yoon and Kim60–Reference Su and Yu62]. SuperGlue needs to be trained on the true value of the trajectory in the abdominal cavity to achieve an adaptive intra-abdominal environment. A bi-directional brute force matching algorithm is utilized to establish correspondence between features in consecutive frames of an image sequence. Additionally, SPSVO uses the Random Sample Consensus algorithm to remove false matches of feature points for robust geometric estimation, see Algorithm 1. Figure 7 shows the results of the proposed algorithm for stereo matching. The successfully matched feature pairs are connected by lines. It can be seen that the SPSVO can accurately match a large number of key points. Moreover, the SPSVO has good consistency in feature matching between frames, where a feature point can be consistently matched across multiple frames. Consistent matching indicates that the proposed SPSVO can effectively estimate camera position.

Algorithm 1. SuperGlue Stereo Matching

Figure 7. Comparison of feature matching. (a) ORB_SLAM2 and (b) SPSVO.

3.5. Keyframe selection

Keyframe selection plays an important role in reducing computational cost, decreasing redundant information, and improving accuracy of VSLAM [Reference Klein and Murray22, Reference Mur-Artal, Montiel J.M. and Tardos25, Reference Strasdat, Montiel and Davison63]. The general criteria for keyframe selection are (1) distribution of the keyframes should not be too dense or too sparse; (2) the number of keyframes should generate sufficient local map points [Reference Liu, Li and Ishii54]. Unlike other SLAM or VO systems, SPSVO integrates a learning-based matching method that can effectively match frames with large differences in baseline length. Therefore, during feature-matching SPSVO only matches the current frame with keyframes, which can reduce tracking error. The keyframe selection criteria should take into account the movement between frames, information gain, tracking stability, and previous experience. Based on the key frame selection principle [Reference Campos, Elvira and Rodriguez30, Reference Xu, Hao and Wang51], the keyframe selection criteria corresponding to the matching process of SPSVO are defined as:

• The distance between the current frame and the nearest keyframe ( $L$ ) satisfies the condition of $L\gt D_{f}$ ;
• The angle between the current frame and the nearest keyframe ( $\theta$ ) satisfies the condition of $\theta \gt \theta _{f}$ ;
• The number of map points ( $N_{A}$ ) tracked by the current frame satisfies the condition $N_{1}^{u}\lt N_{A}\lt N_{2}^{l}$ ;
• The number of the map points ( $N_{B}$ ) tracked by the current frame satisfies the condition $N_{B}\lt N_{3}$ ;
• The number of frames since the last keyframe inserted ( $N_{C}$ ) satisfies the condition of $N_{C}\gt N_{4}$ _.

in which, $D_{f},\theta _{f},N_{1}^{u},N_{2}^{l},N_{3},N_{4}$ are preset thresholds. A frame is selected as a keyframe if it meets any of the above conditions, see Algorithm 2. The proposed keyframe selection criteria consider both image quality and keypoint quality. These can play an important role in filtering useless or incorrect information and avoiding adverse impacts on endoscope localization and scene mapping.

Algorithm 2. The keyframe selection

3.6. Keyframe selection

The Levenberg Marquardt (LM) algorithm is used as the optimization solver in the backend of the proposed SPSVO to construct the Covisibility Graph. For each optimizing iterative loop, when LM optimization converges, both inputs and outputs of the optimization process are set as inputs of the loss function for decoding network training. The optimization variables are keyframes and map points, and the corresponding constraints are the monocular and stereo constraints.

3.6.1 The monocular constraint

If a 3D map point ${}^{w}{P}{_{i}^{}}$ is observed by the left eye camera, the reprojection error $e_{k,i}$ of the i-th point in the k-th frame is defined as

(4)

\begin{align} e_{k,i}=\overset{\wedge }{p}_{i}-\pi _{i}\left({w}^{c}{R}{}{}^{w}{P}{_{i}^{}}+{w}^{c}{t}{}\right), \end{align}

where ${}^{w}{P}{_{i}^{}}$ is the i-th point observed by frame k, w is the world coordinate system and c is the camera coordinate system. R and t are the rotation and translation of the camera. $\overset{\wedge }{p}_{i}=(\overset{\wedge }{u}_{i},\overset{\wedge }{v}_{i})$ is the observation data of the map point on the frame, and $\pi _{i}(\cdot )$ is the camera projection model representing coordinates of the 3D map point projection on the left eye image, expressed as

(5)

\begin{align} \pi _{i}\left(\left[\begin{array}{l} x_{i}\\ y_{i}\\ z_{i} \end{array}\right]\right)=\left[\begin{array}{l} f_{x}\frac{x_{i}}{z_{i}}+c_{x}\\ f_{y}\frac{y_{i}}{z_{i}}+c_{y} \end{array}\right], \end{align}

where $[\begin{array}{lll} x_{i} & y_{i} & z_{i} \end{array}]^{\mathrm{T}}$ are the world coordinates of point ${}^{w}{P}{_{i}^{}}$ , and $f_{x},f_{y},c_{x},c_{y}$ are the intrinsic parameters of camera.

3.6.2 The stereo constraint

If a 3D map point ${}^{w}{P}{_{j}^{}}$ is observed by both left and right cameras at the same time, the reprojection error is defined as

(6)

\begin{align} e_{k,j}=\overset{\wedge }{p}_{j}-\pi _{j}\left({w}^{c}{R}{}{}^{w}{P}{_{j}^{}}+{w}^{c}{t}{}\right), \end{align}

where $\overset{\wedge }{p}_{j}=(\overset{\wedge }{u}_{j},\overset{\wedge }{v}_{j},\overset{\wedge }{r_{j}})$ is the observation data of the map point on the k-th frame of the right image, and $\overset{\wedge }{r}_{j}$ is the horizontal coordinate of the right image. $\pi _{j}(\cdot )$ is the camera projection model representing the 3D map point projection on the stereo image and defined as

(7)

\begin{align} \pi _{j}\left(\left[\begin{array}{l} x_{j}\\ y_{j}\\ z_{j} \end{array}\right]\right)=\left[\begin{array}{l} f_{x}\frac{x_{j}}{z_{j}}+c_{x}\\ f_{y}\frac{y}{z_{j}}+c_{y}\\ f_{x}\frac{x_{j}-b}{z_{j}}+c_{x} \end{array}\right], \end{align}

where b represents the baseline of the stereo camera. $[\begin{array}{lll} x_{j} & y_{j} & z_{j} \end{array}]^{\mathrm{T}}$ are the world coordinates of point ${}^{w}{P}{_{j}^{}}$ .

3.6.3 Graph optimization

Assuming that the distribution of key points satisfies a Gaussian distribution [Reference Szeliski64], the final cost function of the proposed SPSVO can be defined as

(8)

\begin{align} J=\sum _{k,i}\rho _{k,i}\left(\left(e_{k,i}\right)^{\mathrm{T}}\left(\Sigma _{k,i}\right)^{-1}\left(e_{k,i}\right)\right)+\sum _{k,j}\rho _{k,j}\left(\left(e_{k,j}\right)^{\mathrm{T}}\left(\Sigma _{k,j}\right)^{-1}\left(e_{k,j}\right)\right), \end{align}

where $\rho _{k,i}$ and $\rho _{k,j}$ are robust kernel functions to further reduce the impact of any possible outliers. $(e_{k,i})^{\mathrm{T}}$ and $(e_{k,j})^{\mathrm{T}}$ are the transpose of matrix $e_{k,i}$ and $e_{k,j}$ , respectively. $\Sigma _{k,i}$ and $\Sigma _{k,j}$ are covariance matrices, and $(\Sigma _{k,i})^{-1}, (\Sigma _{k,j})^{-1}$ are the inverse of these covariance matrices, respectively.

4. Experimental validation of the proposed SPSVO method

In this section, the performance of the proposed SPSVO is evaluated based on the “colon_reconstruction_dataset” [Reference Zhang, Zhao and Huang38] and compared with ORB-SLAM2. SPSVO is a stereo VO system without loop closure detection module. Furthermore, the colon_reconstruction_dataset does not involve scene re-identification or map closure situations, so the impact of loop closure detection module on algorithm comparison is very limited. Therefore, to ensure fair and accurate comparison, loop closure detection is turned off in ORB-SLAM2. Frame threshold is defined as the number of times a map point is observed by a keyframe for monocular and stereo constraints in graph optimization.

4.1. Dataset

The “colon_reconstruction_dataset” contains 16 stereo colonoscope sequences (named as Case 0–Case 15, there are total of 17,362 frames) with corresponding depth and ego-motion ground truth.

4.2. Implementation details

The proposed SPSVO algorithm runs in a C++ environment on a laptop with an i7-10750H CPU and NVIDIA GTX1650Ti. SPSVO uses Nvidia TensorRT Toolkit to accelerate feature extraction and matching networks and uses the LM algorithm of the g2o library for nonlinear squared optimization. OpenCV and the Ceres library are applied to implement computer vision functions and statistical estimation, respectively.

4.3. Results on the colon reconstruction dataset

The performances of the ORB-SLAM2 and SPSVO were tested with the “colon_reconstruction_dataset”; however, the ORB-SLAM2 could only successfully obtain the endoscope trajectories of “Case 0,” and results are shown in Figs. 8, 9 and Table 2. The data sequences of “Case 0” contain 4751 frames of images for each left and right camera and have slower camera motion speed and smaller translation and rotation amplitude compared to “Case 1” to “Case 10.” It can be observed from Fig. 8 that ORB-SLAM2 has larger drift error compared to the proposed SPSVO method.

Figure 8. Comparison between the endoscope trajectories estimated by ORB-SLAM2 and SPSVO, and the true trajectories. (a) ORB-SLAM2 and (b) SPSVO.

Figure 9. Variation of the absolute pose error (APE) between the estimated trajectories of ORB-SLAM2 and SPSVO and the true trajectories. (a) ORB-SLAM2 and (b) SPSVO.

Comparisons between estimated trajectories and true trajectories of the endoscope are shown in Fig. 10. Colored solid lines represent estimated trajectories of the SPSVO. gray dotted lines represent real motion trajectories of the endoscope corresponding to the “colon_reconstruction_dataset” [Reference Zhang, Zhao and Huang38]. Statistics for SPSVO are shown in Table 3. The average measurement error of SPSVO for the 10 cases is between 0.058 and 0.740 mm, with the RMSE between 0.278 and 0.690 mm. This indicates that the proposed SPSVO method can accurately track the true trajectory of the endoscope. Figure 11 shows the variation of the absolute pose error between estimated and true trajectories with respect to time. It can be observed that the proposed SPSVO method has high accuracy and reliability for endoscope trajectory estimation. ORB-SLAM2 cannot extract enough feature points to initialize the viscera scene map, resulting in a loss of feature tracking and failure to construct endoscopic trajectories. Therefore, quantitative results for ORB-SLAM2 on Case1-Case10 are not presented.

Table II. Statistical error analysis of estimated trajectories of ORB-SLAM2 and SPSVO on Case0 sequence (unit: mm).

Figure 10. Comparison between SPSVO-estimated and true trajectories of the endoscope. (a) Case 1, (b) Case 2, (c) Case 3, (d) Case 4, (e) Case 5, (f) Case 6, (g) Case 7, (h) Case 8, (i) Case 9, and (j) Case 10.

Figure 11. Variation of the absolute pose error (APE) between SPSVO-estimated and true trajectories. (a) Case 1, (b) Case 2, (c) Case 3, (d) Case 4, (e) Case 5, (f) Case 6, (g) Case 7, (h) Case 8, (i) Case 9, and (j) Case 10.

Table III. Statistical error analysis of SPSVO-estimated trajectories (unit: mm). RMSE is the Root Mean Square Error, STD stands for the Standard Deviation. SSE refers to the Sum of Squared Errors.

4.4. Computational cost

Computational time of the SPSVO and ORB-SLAM2 on Case 0 sequence for one frame of the “colon_reconstruction_dataset” [Reference Zhang, Zhao and Huang38] is shown in Table 4. For fair comparison, 1000 points were extracted in this experiment, loop closure, relocalization, and visualization parts were disabled. Keypoint detection takes 25 ms for keypoint extraction of one stereo image. 29 ms are required for stereo matching and feature tracking between frames. Pose estimation is fast and only costs 8ms for one image. Therefore, SPSVO can operate at 14 fps; this speed can be further boosted by parallel implementation. It can be observed that the proposed SPSVO method has faster processing speed compared to ORB-SLAM2.

Table IV. Computational cost of ORB-SLAM2 and SPSVO on Case 0.

5. Conclusions

An important goal in VSLAM for medical applications is accurate estimation of endoscopic pose to better assist surgeons in locating and diagnosing lesions. Extreme illumination variations and weak texture of endoscopy images result in difficulties for accurate estimation of camera motion and scene reconstruction. This paper proposed a novel self-supervised Surgical Perception Stereo Visual Odometer (SPSVO) framework for real-time endoscopic pose estimation and viscera surface map construction. The proposed SPSVO method reduced adverse effects of local illumination variability and specular reflections by using a self-supervised learning (SSL) approach for feature extraction and matching, as well as image illumination pre-processing. In the proposed SPSVO, keyframe selection strategies and the Nvidia TensorRT Toolkit were applied to accelerate computation speed for real-time lesion localization and surgical navigation. Comparison between estimated and the ground truth trajectories of the endoscope were obtained from the colon_reconstruction_dataset. Through experimental tests, the following conclusions are made:

1. The proposed SPSVO system achieves superior performance in variable illumination environments and can track key points in human enterococci with intraperitoneal cavities. Simulation results show that SPSVO has average tracking error of 0.058–0.704 mm with respect to true camera trajectories in the given dataset. Comparison with existing methods also indicates that the proposed method outperforms ORB-SLAM2.
2. The proposed SPSVO system combines advantages of traditional optimization and learning-based methods and demonstrates an operating speed of 14 frames per second on a normal computer. This is adequate for real-time navigation in surgical procedures.
3. The proposed method can effectively eliminate effects of irregular illumination and specular reflections and can accurately estimate the position of the endoscope.

Acknowledgments

I would first like to thank Dr Qimin Li and Dr Yang Luo, whose expertise was invaluable in formulating the research questions and methodology. Your insightful feedback pushed me to sharpen my thinking and brought my work to a higher level.

Author contribution

Junjie Zhao: Conceptualization, Methodology, Software. Yang Luo: Data curation, Writing – Original draft preparation. Qimin Li: Supervision. Natalie Baddour: Writing – Reviewing and Editing. Md Sulayman Hossen: Writing- Reviewing and Editing.

Financial support

This research is funded by the Open Fund of Guangdong Provincial Key Laboratory of Precision Gear Digital Manufacturing Equipment Technology Enterprises (Grant No. 2021B1212050012-04), with contributions from Zhongshan MLTOR Numerical Control Technology Co., LTD and South China University of Technology, as well as the Innovation Group Science Fund of Chongqing Natural Science Foundation (No. cstc2019jcyj-cxttX0003).

Competing interests

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Ethical approval

Not applicable.

Data availability statement

The datasets “colon_reconstruction_dataset” for this study can be found at the https://github.com/zsustc/colon_reconstruction_dataset.

References

Bernhardt, S., Nicolau, S. A., Soler, L. and Doignon, C., “ The status of augmented reality in laparoscopic surgery as of 2016,” Med. Image Anal. 37, 66–90 (2017).CrossRef Google Scholar PubMed

Shao, S., Pei, Z., Chen, W., Zhu, W., Wu, X., Sun, D. and Zhang, B., “ Self-supervised monocular depth and ego-motion estimation in endoscopy: Appearance flow to the rescue,” Med. Image Anal. 77, 102338 (2022).CrossRef Google Scholar PubMed

Feuerstein, M.. Augmented Reality in Laparoscopic Surgery (Vdm Verlag Dr.mller Aktiengesellschaft & Co.kg, 2007).Google Scholar

Lim, P. K., Stephenson, G. S., Keown, T. W., Byrne, C., Lin, C. C., Marecek, G. S. and Scolaro, J. A., “ Use of 3D printed models in resident education for the classification of acetabulum fractures,” J. Surg. Educ. 75(6), 1679–1684 (2018).CrossRef Google Scholar PubMed

Zhang, Z., Xie, Y., Xing, F, McGough, M., Yang, L., “MDNet: A semantically and visually interpretable medical image diagnosis network,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) pp. 3549--3557.Google Scholar

Low, C. M., Morris, J. M., Matsumoto, J. S., Stokken, J. K., O’Brien, E. K. and Choby, G., “ Use of 3D-printed and 2D-illustrated international frontal sinus anatomy classification anatomic models for resident education,” Otolaryngol. Head Neck Surg. 161(4), 705–713 (2019).CrossRef Google Scholar PubMed

Afifi, A., Takada, C., Yoshimura, Y. and Nakaguchi, T., “ Real-time expanded field-of-view for minimally invasive surgery using multi-camera visual simultaneous localization and mapping,” Sensors 21(6), 2106 (2021).CrossRef Google Scholar PubMed

Tatar, F., Mollinger, J. R., Den Dulk, R. C., van Duyl, W. A., Goosen, J. F. L. and Bossche, A., “Ultrasonic Sensor System for Measuring Position and Orientation of Laproscopic Instruments in Minimal Invasive Surgery,” 2nd Annual International IEEE-EMBS Special Topic Conference on Microtechnologies in Medicine and Biology. Proceedings (Cat. No. 02EX578), (2002) pp. 301–304.Google Scholar

Lamata, P., Morvan, T., Reimers, M., E. Samset and J. Declerck, “Addressing Shading-based Laparoscopic Registration,” World Congress on Medical Physics and Biomedical Engineering, September 7-12, 2009, Munich, Germany: Vol. 25/6 Surgery, Nimimal Invasive Interventions, Endoscopy and Image Guided Therapy, (2009) pp. 189–192.Google Scholar

Wu, C.-H., Sun, Y.-N. and Chang, C.-C., “Three-dimensional modeling from endoscopic video using geometric co Qax‘ nstraints via feature positioning,” IEEE Trans. Biomed. Eng. 54(7), 1199–1211 (2007).Google Scholar

Seshamani, S., Lau, W. and Hager, G.. Real-time endoscopic mosaicking. In: Medical Image Computing and Computer-Assisted Intervention-MICCAI 2006: 9th International Conference, , Copenhagen, Denmark, October 1-6, 2006. Proceedings, Part I 9, (2006) pp. 355–363,Google Scholar

Thormahlen, T., Broszio, H. and Meier, P. N., “Three-dimensional Endoscopy,” Falk Symposium, (2002), 2002-01.Google Scholar

Koppel, D., Chen, C.-I., Wang, Y.-F., H. Lee, J. Gu, A. Poirson and R. Wolters, “Toward Automated Model Building from Video in Computer-assisted Diagnoses in Colonoscopy,” Medical Imaging 2007: Visualization and Image-Guided Procedures, (2007) pp. 567–575.Google Scholar

Mirota, D., Wang, H., Taylor, R. H., M. Ishii and G. D. Hager, “Toward Video-based Navigation for Endoscopic Endonasal Skull Base Surgery,” Medical Image Computing and Computer-Assisted Intervention-MICCAI 2009: 12th International Conference, London, UK, September 20-24, 2009, Proceedings, Part I 12, (2009) pp. 91–99.Google Scholar

Kaufman, A. and Wang, J., “3D surface reconstruction from endoscopic videos,” Math. Visual., 61–74 (2008).CrossRef Google Scholar

Hong, D., Tavanapong, W., Wong, J., J. Oh, P.-C. De Groen, “3D reconstruction of colon segments from colonoscopy images, "2009 Ninth IEEE International Conference on Bioinformatics and BioEngineering, (2009) pp. 53–60.Google Scholar

Jang, J. Y., Han, H.-S., Yoon, Y.-S., Jai, Y. and Choi, Y., “Retrospective comparison of outcomes of laparoscopic and open surgery for T2 gallbladder cancer - thirteen-year experience,” Surg. Oncol. 29, 29–147 (2019).CrossRef Google Scholar PubMed

Wu, H., Zhao, J., Xu, K., Zhang, Y., Xu, R., Wang, A. and Iwahori, Y., “ Semantic SLAM based on deep learning in endocavity environment,” Symmetry-Basel 14(3), 614 (2022).CrossRef Google Scholar

Xie, C., Yao, T., Wang, J. and Liu, Q., “ Endoscope localization and gastrointestinal feature map construction based on monocular SLAM technology,” J. Infect Public Health 13(9), 1314–1321 (2020).CrossRef Google Scholar PubMed

Mountney, P., Stoyanov, D., Davison, A., and G.-Z. Yang, “Simultaneous Stereoscope Localization and Soft-tissue Mapping for Minimal Invasive Surgery, ” Medical Image Computing and Computer-Assisted Intervention-MICCAI 2006: 9th International Conference, Copenhagen, Denmark, October 1-6, 2006. Proceedings, Part I 9, (2006) pp. 347–354.Google Scholar

Mountney, P. and Yang, G.-Z., “Motion Compensated SLAM for Image Guided Surgery,” Medical Image Computing and Computer-Assisted Intervention-MICCAI 2010: 13th International Conference, Beijing, China, September 20-24, 2010, Proceedings, Part II 13, (2010) pp. 496–504.Google Scholar

Klein, G., Murray, D., “Parallel Tracking and Mapping for Small AR Workspaces,” 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, (2007) pp. 225–234.Google Scholar

Lin, B., Johnson, A., Qian, X., J. Sanchez and Y. Sun, “Simultaneous Tracking, 3D Reconstruction and Deforming Point Detection for Stereoscope Guided Surgery,” Augmented Reality Environments for Medical Imaging and Computer-Assisted Interventions: 6th International Workshop, MIAR 2013 and 8th International Workshop, AE-CAI 2013, Held in Conjunction with MICCAI 2013, Nagoya, Japan, September 22, 2013. Proceedings, (2013) pp. 35–44.Google Scholar

Lin, B., Sun, Y., Sanchez, J. E. and Qian, X., “ Efficient vessel feature detection for endoscopic image analysis,” IEEE Trans. Biomed. Eng. 62(4), 1141–1150 (2014).CrossRef Google Scholar

Mur-Artal, R., Montiel J.M., M. and Tardos, J. D., “ORB-SLAM: A versatile and accurate monocular SLAM system,” IEEE Trans. Robot. 31(5), 1147–1163 (2015).CrossRef Google Scholar

Mahmoud, N., Cirauqui, I., Hostettler, A., C. Doignon, L. Soler, J. Marescaux and J. M. M. Montiel, “ORBSLAM-based Endoscope Tracking and 3D Reconstruction,” Computer-Assisted and Robotic Endoscopy: Third International Workshop, CARE 2016, Held in Conjunction with MICCAI 2016, Athens, Greece, October 17, 2016, Revised Selected Papers 3, (2017) pp. 72–83.Google Scholar

Mahmoud, N., Collins, T., Hostettler, A., Soler, L., Doignon, C. and Montiel, J. M. M, “Live tracking and dense reconstruction for handheld monocular endoscopy,” IEEE Trans. Med. Imag. 38(1), 79–89 (2019).CrossRef Google Scholar PubMed

Recasens, D., Lamarca, J., Facil, J. M., Montiel, J. M. M. and Civera, J., “ Endo-depth-and-motion: Localization and reconstruction in endoscopic videos using depth networks and photometric constraints,” IEEE Robot. Automat. Lett. 6(4), 7225–7232 (2021).CrossRef Google Scholar

Rublee, E., Rabaud, V., Konolige, K., and G. Bradski, “ORB: An Efficient Alternative to SIFT or SURF,” IEEE International Conference on Computer Vision, ICCV 2011, (2011).Google Scholar

Campos, C., Elvira, R., Rodriguez, J., M. Montiel and J. D. Tardós, “ORB-SLAM3: An accurate open-source library for visual, visual-inertial, and multimap SLAM,” IEEE Trans. Robot. Publ. IEEE Robot. Automat. Soc 37(6), 1874–1890 (2021).Google Scholar

Detone, D., Malisiewicz, T. and Rabinovich, A., “SuperPoint: Self-supervised interest point detection and description (2017), arXiv: 1712.07629.Google Scholar

Chang, P.-L., Stoyanov, D., Davison, A. J., and P. E. Edwards, “Real-time Dense Stereo Reconstruction Using Convex Optimisation with a Cost-volume for Image-guided Robotic Surgery,” Medical Image Computing and Computer-Assisted Intervention-MICCAI 2013: 16th International Conference, Nagoya, Japan, September 22-26, 2013, Proceedings, Part I 16, (2013) pp. 42–49.Google Scholar

Lin, B., Sun, Y., Sanchez, J. and X. Qian “Vesselness based Feature Extraction for Endoscopic Image Analysis, "2014 IEEE 11th International Symposium on Biomedical Imaging (ISBI), (2014) pp. 1295–1298.Google Scholar

Engel, J., Koltun, V. and Cremers, D., “Direct sparse odometry,” (2016): arXiv e-prints.Google Scholar

Zubizarreta, J., Aguinaga, I. and Montiel, J., “Direct sparse mapping,” (2019): arXiv:1904.06577.Google Scholar

Forster, C., Pizzoli, M. and Scaramuzza, D., “SVO: Fast Semi-direct Monocular Visual Odometry,” IEEE International Conference on Robotics & Automation, (2014).Google Scholar

Sarlin, P. E., Detone, D., Malisiewicz, T., and A. Rabinovich, “SuperGlue: Learning Feature Matching With Graph Neural Networks,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020).Google Scholar

Zhang, S., Zhao, L., Huang, S. and Q. Hao, “A template-based 3D reconstruction of colon structures and textures from stereo colonoscopic images,” IEEE Trans. Med. Robot. Bionics 3(1), 85--95 (2021).CrossRef Google Scholar

Mur-Artal, R. and Tardós, J., “ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras,” IEEE Trans. Robot 33(5), 1255–1262 (2017).CrossRef Google Scholar

https://github.com/UZ-SLAMLab/ORB_SLAM2_Endoscopy.Google Scholar

Durrant-Whyte, H. and Bailey, T., “Simultaneous localization and mapping: Part I,” IEEE Robot. Automat. Magaz. 13(2), 99–110 (2006).CrossRef Google Scholar

Bartoli, A., Montiel, J., Lamarca, J., and Q. Hao, DefSLAM: Tracking and Mapping of Deforming Scenes from Monocular Sequences, (2019): arXiv: 1908.08918.Google Scholar

Gong, H., Chen, L., Li, C., J. Zeng, X. Tao and Y. Wang, “Online tracking and relocation based on a new rotation-invariant haar-like statistical descriptor in endoscopic examination,” IEEE Access 8, 101867–101883 (2020).CrossRef Google Scholar

Song, J., Wang, J., Zhao, L., S. Huang and G. Dissanayake, “Mis-slam: Real-time large-scale dense deformable slam system in minimal invasive surgery based on heterogeneous computing,” IEEE Robot. Automat. Lett. 3(4), 4068–4075 (2018).CrossRef Google Scholar

Wei, G., Feng, G., Li, H., T. Chen, W. Shi and Z. Jiang, “A Novel SLAM Method for Laparoscopic Scene Reconstruction with Feature Patch Tracking,” 2020 International Conference on Virtual Reality and Visualization (ICVRV), (2020) pp. 287–291.Google Scholar

Song, J., Zhu, Q., Lin, J., and M. Ghaffari, “BDIS: Bayesian dense inverse searching method for real-time stereo surgical image matching,” IEEE Trans. Robot., 39(2), 1388--1406 (2022).Google Scholar

Wei, G., Yang, H., Shi, W., Z. Jiang, T. Chen and Y. Wang, “Laparoscopic Scene Reconstruction based on Multiscale Feature Patch Tracking Method,” International Conference on Electronic Information Engineering and Computer Science (EIECS), (2021) pp. 588–592.Google Scholar

Yadav, R. and Kala, R., “Fusion of visual odometry and place recognition for slam in extreme conditions,” Appl. Intell. 52(10), 11928–11947 (2022).CrossRef Google Scholar

Bruno H.M., S. and Colombini, E. L., “LIFT-SLAM: A deep-learning feature-based monocular visual SLAM method,” Neurocomputing 455, 97–110 (2021).CrossRef Google Scholar

Schmidt, A. and Salcudean, S. E., “Real-time rotated convolutional descriptor for surgical environments,” Medical Image Computing and Computer Assisted Intervention-MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, (2021) pp. 279–289.Google Scholar

Xu, K., Hao, Y., Wang, C., and L. Xie, “AirVO: An illumination-robust point-line visual odometry, (2022): arXiv preprint arXiv: 2212.07595.Google Scholar

Li, D., Shi, X., Long, Q., S. Liu, W. Yang, F. Wang and F. Qiao, “DXSLAM: A Robust and Efficient Visual SLAM System with Deep Features,” 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), (2020) pp. 4958–4965.Google Scholar

Liu, X., Zheng, Y., Killeen, B., M. Ishii, G. D. Hager, R. H. Taylor and M. Unberath, “Extremely Dense Point Correspondences using a Learned Feature Descriptor,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020) pp. 4847–4856.Google Scholar

Liu, X., Li, Z., Ishii, M., G. D. Hager, R. H. Taylor and M. Unberath, “Sage: Slam with Appearance and Geometry Prior for Endoscopy,” 2022 International Conference on Robotics and Automation (ICRA), (2022) pp. 5587–5593.Google Scholar

Muja, M. and Lowe, D. G., “Fast approximate nearest neighbors with automatic algorithm configuration,” Proceedings of the Fourth International Conference on Computer Vision Theory and Applications, Lisboa, Portugal (February 5-8, 2009).Google Scholar

Barbed, O. L., Chadebecq, F., Morlana, J., J. M. Montiel and A. C. Murillo, “ SuperPoint Features in Endoscopy,” MICCAI Workshop on Imaging Systems for GI Endoscopy, (2022) pp. 45–55.Google Scholar

Sarlin, P.-E., Cadena, C., Siegwart, R. and M. Dymczyk, “From Coarse to Fine: Robust Hierarchical Localization at Large Scale,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019) pp. 12716–12725.Google Scholar

Oliva Maza, L., Steidle, F., Klodmann, J., K. Strobl and R. Triebel, “An ORB-SLAM3-based approach for surgical navigation in ureteroscopy,” Comput. Methods Biomech. Biomed. Eng. Imag. Visual., 11(4), 1005--1011 (2022).Google Scholar

Zuiderveld, K., “Contrast limited adaptive histogram equalization,” Graphics Gems, 474–485 (1994).CrossRef Google Scholar

Jang, H., Yoon, S. and Kim, A., “Multi-session Underwater Pose-graph Slam using Inter-session Opti-acoustic Two-view Factor,” 2021 IEEE International Conference on Robotics and Automation (ICRA), (2021) pp. 11668–11674.Google Scholar

Rao, S., “SuperVO: A Monocular Visual Odometry based on Learned Feature Matching with GNN,” 2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE), 2021) pp. 18–26.CrossRef Google Scholar

Su, Y. and Yu, L., “A dense RGB-D SLAM algorithm based on convolutional neural network of multi-layer image invariant feature,” Meas. Sci. Technol. 33(2), 025402 (2021).CrossRef Google Scholar

Strasdat, H., Montiel, J., Davison, A. J., “Real-time Monocular SLAM: Why Filter?,” IEEE International Conference on Robotics and Automation, (2010) pp. 2657–2664.Google Scholar

Szeliski, R.. Computer Vision: Algorithms and Applications (Springer Nature, 2022).CrossRef Google Scholar

Figure 1. Frames from “colon_reconstruction_dataset.” (a) a Small field of view, (b) specular reflections, (c) lighting changes.

Figure 2. Classic VSLAM framework.

Figure 3. Structure of the SPSVO system.

Figure 4. Image pre-processing. (a) Original image and (b) image after using CLAHE.

Figure 5. Comparison of number of keypoints matched per keyframe with feature points threshold of 1600.

Figure 6. Comparison of feature extraction results of the ORB-SLAM2 and SPSVO methods. (a) ORB-SLAM2 and (b) SPSVO.

Algorithm 1. SuperGlue Stereo Matching

Figure 7. Comparison of feature matching. (a) ORB_SLAM2 and (b) SPSVO.

Algorithm 2. The keyframe selection

Figure 8. Comparison between the endoscope trajectories estimated by ORB-SLAM2 and SPSVO, and the true trajectories. (a) ORB-SLAM2 and (b) SPSVO.

Figure 9. Variation of the absolute pose error (APE) between the estimated trajectories of ORB-SLAM2 and SPSVO and the true trajectories. (a) ORB-SLAM2 and (b) SPSVO.

Table II. Statistical error analysis of estimated trajectories of ORB-SLAM2 and SPSVO on Case0 sequence (unit: mm).

Table III. Statistical error analysis of SPSVO-estimated trajectories (unit: mm). RMSE is the Root Mean Square Error, STD stands for the Standard Deviation. SSE refers to the Sum of Squared Errors.

Table IV. Computational cost of ORB-SLAM2 and SPSVO on Case 0.

Article contents

SPSVO: a self-supervised surgical perception stereo visual odometer for endoscopy

Abstract

Keywords

1. Introduction

2. Related work

2.1. VSLAM and VO for endoscopy

2.2. SLAM based on SuperPoint and SuperGlue

3. Proposed SPSVO approach

3.1. System overview

3.2. Image pre-processing

3.3. Proposed SuperPoint model

3.4. Feature matching

3.5. Keyframe selection

3.6. Keyframe selection

3.6.1 The monocular constraint

3.6.2 The stereo constraint

3.6.3 Graph optimization

4. Experimental validation of the proposed SPSVO method

4.1. Dataset

4.2. Implementation details

4.3. Results on the colon reconstruction dataset

4.4. Computational cost

5. Conclusions

Acknowledgments

Author contribution

Financial support

Competing interests

Ethical approval

Data availability statement

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests