1. Introduction
Visual odometry (VO) enables robots to perceive the surrounding environments and localize themselves and thus is essential in many industries, such as auto-driving and augmented reality. VO is generally regarded as a key component of visual-based Simultaneous Localization And Mapping (v-SLAM), which further incorporates loop closure detection and global optimization modules to alleviate the accumulated drift in a long run [Reference Company-Corcoles, Garcia-Fidalgo and Ortiz1]. Compared with LiDAR- and GPS-based methods, VO estimates the trajectories of robots by analyzing the sequences of images captured. Cameras provide abundant texture information at a lower price of hardware, and the depth value of environments can be measured efficiently with the availability of RGB-D cameras, especially indoors.
In recent years, significant advancements have been made in real-time methods for SLAM and VO. Typical systems [Reference Mur-Artal, Montiel and Tardos2–Reference Engel, Koltun and Cremers6] directly operate on pixels and estimate camera poses through the correspondences between pixels. Feature-based methods [Reference Mur-Artal, Montiel and Tardos2–Reference Wang and Lin4] find corresponding feature points and estimate camera poses by solving PnP problem. Feature points are distinctive in texture-rich environments while their performances are generally degraded in poorer texture environments. Direct methods [Reference Kerl, Sturm and Cremers5, Reference Engel, Koltun and Cremers6], using the actual value of the measurement, are often affected by brightness changes. Deep learning-based approaches [Reference Jia, Pu, Chen, Cheng, Liao and Yang7, Reference Yang, v. Stumberg, Wang and Cremers8] have shown promising results with end-to-end training, but face challenges in generalization and computational efficiency. High-level features like lines [Reference Taubner, Tschopp, Novkovic, Siegwart and Furrer9–Reference Zhao, Huang, Yan and Dissanayake11] and planes [Reference Yang, Song, Kaess and Scherer12, Reference Yang and Scherer13] describe the environment structurally. Fitted by amounts of points, planes can effectively eliminate the influence of measurement noise, especially in indoor environments. Indoor environments are mostly constructed by various manufactured objects and structures. Their surfaces are usually planes with limited size, contours, and edges. Planes representing indoor environment thus can be exploited to construct constraints for robust pose estimation.
In this paper, we design features at the superpixel level. A superpixel is a series of adjacent pixels with similar characteristics, including similar brightness, similar texture, and low contour energy inside the region [Reference Ren and Malik14]. Superpixels have good properties such as content sensitivity, high information aggregation, and edge preservation. We design SegPatch as a superpixel-based feature for scene representation instead of points and planes. SegPatch exploits the regional information of the superpixel to approximate minor details of the small structures, achieving higher computational efficiency than planes in complex environments. At the same time, SegPatch removes the spatial redundancy, making it less affected by measurement noise and more distinctive, thus offsets the lack of feature points in poor texture environments. As a result, SegPatch could robustly characterize environments with various texture densities. Correspondingly, MapPatch is designed for mapping. Structural parameters are estimated from depth measurements and updated with local maps.
We propose a superpixel-based visual odometry to achieve robust and accurate pose estimation results. Using an RGB-D camera, we decompose the RGB image into superpixels and estimate structural parameters from the depth image. SegPatches are then constructed and their corresponding MapPatches are initialized. We propose a novel matching algorithm to establish the correspondences between two consecutive frames or between frames and the local map. The similarity between SegPatches is measured by the defined distance. To improve computational efficiency, we propose a novel searching method following the idea of approximate nearest neighbor search using hierarchical navigable small world (HNSW [Reference Malkov and Yashunin15]) in scale space. By constructing constraints derived from the variation in brightness, epipolar geometry, and normal consistency, the camera pose is finally estimated via all matched correspondences. The local map is established simultaneously. Figure 1 shows the framework of our proposed VO system. We provide real-world evaluation on the TUM RGB-D [Reference Sturm, Engelhard, Endres, Burgard and Cremers16] dataset to prove the feasibility of using superpixel-based feature in VO system and show good balance of our proposed VO to adapt different environments.
In summary, the contributions of this paper are as follows:
-
1. We design superpixel-based feature named SegPatch. Through the approximation of small structures and the elimination of redundant information, SegPatch shows good adaptability and uniqueness in different environments.
-
2. We propose a novel matching algorithm for SegPatches. The efficient searching strategies and accurate similarity measurements robustly provide credible correspondences for further use.
-
3. We propose a superpixel-based RGB-D visual odometry and evaluate it on TUM indoor datasets, showing good adaptability in various environments and reaching comparable results with state-of-the-art solutions to VO systems.
The rest of paper is structured as follows. Section 2 provides background on related work. We first introduce the superpixel-based feature SegPatch in Section 3, then present the matching algorithm and pose estimation method in VO framework (Section 4), followed by implementations and experiments in Section 5. Section 6 concludes this paper.
2. Related works
According to different tracking and mapping method, traditional geometric-based VO can be roughly divided into feature-based and direct methods [Reference Yang, Wang and Cremers17]. Among them, feature-based approaches are typically more robust to illumination changes than direct methods. In this paper, our proposed visual odometry is to construct and match features at superpixel level. As such, our proposed method is relevant to the existing VO methods reviewed below.
Point-feature-based methods. A number of research works using point-feature-based methods have been reported, notable examples are ORB-SLAM [Reference Mur-Artal, Montiel and Tardos2], RGBD-SLAM [Reference Endres, Hess, Sturm, Cremers and Burgard18], and DVO [Reference Kerl, Sturm and Cremers5]. Most of these methods have been extended to stereo or RGB-D versions such as ORB-SLAM2 [Reference Mur-Artal and Tardós19] and DVO for RGB-D camera [Reference Kerl, Sturm and Cremers20]. Point-feature-based methods generally extract and match features with strong significance and minimize geometry reprojection error. The availability of robust feature detectors and descriptors enables feature-based methods to construct correspondences even under large movement between frames [Reference Pretto, Menegatti, Bennewitz, Burgard and Pagello21]. Therefore, point-feature-based methods often have robust performances in rich texture scenes that involve varying illuminations, occlusions, and large viewpoint changes. However, the heavy dependence on textures impairs the robustness in poor texture environments due to the shortage of reliable feature points.
Plane and Surfel-based methods. To solve the problems of point features, planes are exploited to refine the performance in a few systems. Earlier works like [Reference Weingarten and Siegwart22] added planes into the extended Kalman filter state vectors. However, the growing size of the dense covariance matrix will lead to a huge computational cost, so these methods are only applicable to some small scenes. In ref. [Reference Lee, Lim, Lee, An and Oh23], a fast plane extraction and matching method for indoor mapping is proposed, and a minimal representation for infinite planes was introduced in ref. [Reference Kaess24]. CPA-SLAM [Reference Ma, Kerl, Stückler and Cremers25] modeled the environment with a global plane and used the planes for tracking and global graph optimization. Then, KDP-SLAM [Reference Hsiao, Westman, Zhang and Kaess26] proposed a keyframe-based approach based on dense plane extraction and matching using a projective plane association algorithm. Besides, the authors of [Reference Taguchi, Jian, Ramalingam and Feng27, Reference Zhang, Wang, Qi, Liao and Wei28] used points and planes together and tried to optimize them for higher accuracy and efficiency. Structural properties indoors, such as Manhattan World Assumption, were exploited to construct relationships between planes and points [Reference Li, Yunus, Brasch, Navab and Tombari29].
Compared to extracting planes from point clouds above, surfels [Reference Pfister, Zwicker, van Baar and Gross30] as finite planar elements often provide more accurate surface representation, better surface fitting, and more efficient computation. DPPTAM [Reference Concha and Civera31] modeled the environment with 3D points in high-gradient areas and 3D piecewise planes in low-gradient areas for dense scene reconstruction. Probabilistic Surfel Map [Reference Yan, Ye and Ren32] was proposed to maintain a globally consistent map with both photometric and geometric uncertainties encoded. Then, UcoSLAM [Reference Muñoz-Salinas and Medina-Carnicer33] combined the points and squared planar markers as features alleviated the relocalization problem in repetitive environments. In SP-SLAM [Reference Cho, Jo and Kim34], feature points and surfels were also optimized together to eliminate the effect of different texture densities on systems. MSC-VO [Reference Company-Corcoles, Garcia-Fidalgo and Ortiz1] introduced structural constraints combined with estimated Manhattan axes and the reprojection errors, allowing the method to work for a wider variety of scenarios. Surfels were initialized directly from sparse planes by extracting superpixels to utilize the structural information indoors in ManhattanSLAM [Reference Yunus, Li and Tombari35].
Learning-based methods. The great progress of deep learning stimulates the learning-based VO by predicting the relative poses between images with supervised [Reference Wang, Clark, Wen and Trigoni36] or unsupervised learning [Reference Li, Wang, Long and Gu37]. DeepVO [Reference Wang, Clark, Wen and Trigoni36] employed an RNN to predict camera pose end-to-end from input image sequence, and UndeepVO [Reference Li, Wang, Long and Gu37] concurrently trained a pose net and a depth net to form CNN-based VO systems. DVSO [Reference Yang, Wang, Stuckler and Cremers38] proposed a virtual stereo term that incorporates the depth estimation from a semi-supervised network into a direct VO pipeline. D $^2$ VO [Reference Jia, Pu, Chen, Cheng, Liao and Yang7] presented a novel deep learning and direct method-based monocular visual odometry, and D3VO [Reference Yang, v. Stumberg, Wang and Cremers8] exploited deep networks on three levels – deep depth, pose, and uncertainty estimation. However, although learning-based VO has advantages in feature extraction and robustness, it also suffers from high computational demands and generalization.
In this article, we propose a new VO method in which SegPatch is used as superpixel-based feature. SegPatch compensates for the shortcomings of points and surfels in representing various environments. Moreover, our superpixel-based method also holds potential use of learning-based methods, given its ability to extract informative and compact feature descriptors from images. A detailed explanation of superpixel-based VO is given in the following sections.
3. Concepts and model formulation
This section explains fundamental concepts that are necessary to follow the remainder of the latter. Our method is the first VO system using the superpixel-based features. Thus we design the superpixel-based feature SegPatch as basis.
3.1. Superpixel
Superpixels are sub-regions of the over-segmented image with similar information, which is sensitive to the variance of texture density. By SLIC [Reference Achanta, Shaji, Smith, Lucchi, Fua and Süsstrunk39] algorithm, superpixels are initialized into regular grids. Then, the pixels are clustered locally. In order to keep the central intensity gradient fluctuating within a specific range, we set upper and lower limits for merging and splitting. Iterations are repeated until every pixel is classified stably. An image $I$ is finally decomposed into $M$ superpixels $I=\{S_i, i = 1,\ldots, M\}$ , where $\forall S_i, S_j \in I, i\neq j$ , $S_i \bigcap S_j = \emptyset$ . For a region $\Omega \in I$ , the higher texture density in $\Omega$ , the more superpixels generated, and vice versa. Thus, the distribution of superpixels with different sizes reflects the complexity of the environment. Due to the intra-region similarity and inter-region dissimilarity, superpixels only maintain the major information of sub-regions, ensuring their effectiveness in various environments.
3.2. SegPatch and MapPatch
Because of the irregular and non-stable shape, the superpixel is insufficient to provide a robust image descriptor. We propose SegPatch as the superpixel-based features, extended from SuperPatch [Reference Giraud, Ta, Bugeau, Coupe and Papadakis40]. A SegPatch $\textbf{S}_i$ is centered at the superpixel $S_i$ and is combined with its neighboring superpixels $\{S_{i^{\prime}}\}$ such that $\textbf{S}_i = \{ S_{i^{\prime}}| \| c_i - c_{i^{\prime}} \|_2 \leq \mathcal{R} \}$ , where $c_i$ is the spatial barycenter and $\mathcal{R}$ is the parameter based on scale factor and superpixel granularity. We use $\mathcal{R}$ to control the number of neighboring superpixels, in order that every SegPatch has the similar size topologically.
The SegPatch is then constructed as a dual descriptor $\textbf{S}_i = \{F_{S_i}, \textbf{X}_{S_i}\}$ , including inner-feature $F_{S_i}$ and the inter-features $\textbf{X}_{S_i}$ . Inner-feature $F_{S_i} = \{\textbf{p}_i, \textbf{c}_i, r_i, s_i\}$ contains the information of the central superpixel, where $\textbf{p}_i = (u_i, v_i)^T$ is the coordinates of the barycenter, $\textbf{c}_i = (l_i, a_i, b_i)$ is mean color in LAB color space, $\pi r_i^2$ represents size of the central superpixel, and $s_i$ the scalar. Inter-features $\textbf{X}_{S_i} = \{X_{S_{i^{\prime}}}\}_{i^{\prime}=1,2,\ldots m}$ describe the relationship between the center superpixel and the neighboring superpixels. For every neighboring element, $X_{S_{i^{\prime}}} = (u_i - u_{i^{\prime}}, v_i - v_{i^{\prime}})$ is a vector pointing to the center superpixel, where $(u_{i^{\prime}},v_{i^{\prime}})$ is the barycenter of the neighboring superpixel. Let $I(S_i)$ be the mean intensity of the superpixel, the gradient of the SegPatch
is a holistic character to describe the direction of change in a small region.
In 3D space, we further construct the MapPatch for scene reconstruction. MapPatch is described as $\Sigma _M = [\{P_i\}_M,\textbf{n}_M,R_M]$ , maintaining the index of the corresponding SegPatches at the same time. $\{P_i\}_M$ is the set of control points in the world coordinate, where $P_i$ is initialized from the barycenter of SegPatch $\textbf{S}_i$ , that is $P_i = K^{-1}\textbf{p}_i$ . SVD method is used to initialize the normal $\textbf{n}_M$ from the depth data. The radius $R_M = 2r*z/(f_x+f_y)$ is also initialized so that the projection of the MapPatch can cover the corresponding superpixel in the observed frame. When a MapPatch is observed from different perspectives, the number of its control points increases, and the normal $\textbf{n}_M$ is fine-tuned by minimizing a fitting error defined as
where $\bar{P}$ is the mean of the control points $P_i$ and $b$ estimates the bias. $R^2_M$ is always the minimal size from $\bar{P}$ to cover all the corresponding superpixels in different frames.
The SegPatch construction is indicated in Figure 2. The definitions of SegPatch and MapPatch fully exploit the sensitivity to the texture density of superpixels. SegPatch could adaptively adjust its size in areas with different texture densities, ensuring its robustness in various environments. In the poor texture region, SegPatch contains a larger region. The saliency of SegPatch is greatly improved by removing the image redundancy. By using structure parameters, MapPatch models the environment structurally. In areas with rich textures, superpixels degenerate towards individual pixels. SegPatch, as a superpixel-based feature, assures uniqueness and robustness due to the inclusion of local information. Drawing from the small SegPatch, the accuracy of MapPatch does not suffer from the approximation. In contrast, the accuracy of the reconstruction is guaranteed due to the fine-grained structure of the fragments. Additionally, as a regional feature, SegPatch is also not sensitive to image noise and motion blur.
3.3. Scale space and image pyramid
SegPatch needs to be constructed at different scales because searching for correspondences requires comparing images seen at different scales. The image pyramid often represents the scale space. The images are repeatedly smoothed with a Gaussian filter and then sub-sampled to achieve a higher level of pyramid. The down-sampling procedure reflects the variation of the observed distance. We extract the superpixels at each layer of the pyramid and distribute them to the superpixels at the next higher layers. Superpixels at the same level distributing to the same superpixel are brother nodes of each other and are children of the one distributed. Thus the pyramid is a fully connected hierarchical graph in the form of a multiple-tree. The construction of the image pyramid is shown in Figure 3a.
After constructing an image pyramid and extracting superpixels for multiple images of different scenes and views, we find that the number of superpixels decreases exponentially from low to high layers, and the sizes of superpixels segmented in the same scale follow the same exponential distribution. The spatial distribution of the number of superpixels between layers is also uniform. The pyramid could thus tell rich texture regions from poor texture regions.
4. Superpixel-Based visual odometry
The structure of the proposed superpixel-based visual odometry is shown in Figure 1. The system starts with RGB-D image preprocessing, in which images are segmented into superpixels, SegPatches are constructed in scale space and MapPatches are initialized. Searching and matching procedure provides correspondences between two consecutive frames or between frames and the local map. Finally, the camera pose is estimated by minimizing reprojection errors between SegPatch measurements and observations. The local map is also updated in separate thread. This section describes the key methods of the proposed VO in detail.
4.1. SegPatch comparing, searching and matching
For two SegPatches, the number of elements and geometry are generally different, so it is difficult to compare directly. To measure the similarity between two SegPatches $\textbf{S}_i^A$ and $\textbf{S}_j^B$ in different images $I_A$ and $I_B$ , the distance is defined as follows.
where $d$ is the Euclidean distance between the inner features. Let weight $\omega$ represents the relative overlapping area of neighboring superpixels $S^A_{i^{\prime}}$ and $S^B_{j^{\prime}}$ , following Gaussian distribution symmetrically
where $x_{i^{\prime}j^{\prime}} = (X_{S_{i^{\prime}}}-X_{S_{j^{\prime}}}) / s_is_j$ is the relative distance between $\textbf{p}_{i^{\prime}}$ and $\textbf{p}_{j^{\prime}}$ , $\varpi (S^A_{i^{\prime}})$ weights the influence of $S^A_{i^{\prime}}$ according to its distance to $S^A_{i}$ as $\omega (S^A_{i^{\prime}}) = \exp \left (-|X_{S_{i^{\prime}}}|^2/(s_i\sigma _2)^2\right )$ . $\varpi (S^B_{i^{\prime}})$ is as the same. $s_i$ , $s_j$ are scale coefficients, and $\sigma _1$ , $\sigma _2$ are two control parameters depending on the superpixel segmentation. Depending on the superpixel deposition scale, $\sigma _1 = \frac{1}{2}\sqrt{N/M}$ for an image with $N$ pixels decomposed into $M$ superpixels is set as half of the average distance between superpixel barycenters. Depending on the SegPatch size, $\sigma _2 = \sqrt{2}\mathcal{R}$ is set to weight the contribution of closest superpixels.
The matching task in VO consists of matching between two consecutive frames (2d-2d), as well as matching between frames and the local map (2d-3d). PatchMatch (PM) [Reference Barnes, Shechtman, Finkelstein and Goldman41] is a classical method to compute pixel-based patch correspondences between two images. As an extension of PM, SuperPatchMatch (SPM) [Reference Giraud, Ta, Bugeau, Coupe and Papadakis40] provides correspondences of irregular structures from superpixel decompositions. These two methods share one key point that good correspondences can be propagated to the adjacent patches within an image. However, it takes a long time in the propagation step to find the optimal solution.
Instead of processing images in scan order in PM and SPM, our proposed matching algorithm exploits the graph search method for fast-searching candidates. We propose a new modification of the HNSW algorithm [Reference Malkov and Yashunin15] and demonstrate it to be applicable. HNSW is an approximate K-nearest neighbor search method based on navigable small-world graphs with controllable hierarchy. The image pyramid satisfies the four requirements of the implementation of the HNSW algorithm: (1) The construction of the image pyramid is a hierarchical graph; (2) The number of upward nodes (superpixels) decreases following the exponential decay; (3) The highest projection layer of a node is obtained by the exponential decay probability function based on its position and size; (4) The node could be found in all layers from the highest layer down. Furthermore, due to the assumption that the image sequences are consecutive and the motion is not so fast, we set a trust region for optimal selection. A candidate located in the trust region is with higher probability to be the best match. The search process is performed layer-by-layer from coarse to fine in both scale and position. The HNSW and our proposed method are shown in Figure 3.
The following steps are then performed to find and improve the correspondences between frames. For the sake of clarity, we assume the $N$ -layer image pyramid with the top $I_0$ , the current best match of a SegPatch $\textbf{S}_i^A \in I^A$ in $I^B_k$ , is donated as $\mathcal{M}^B_k(i)$ , storing the distance and the index of their corresponding SegPatch. First of all, for every SegPatch $\textbf{S}_i^A \in I^A$ , we delineate a trust region in $I_0^B$ according to the position and size of $\textbf{S}_i^A$ . Matching candidates are selected from the trust region. After similarity measurement and comparison, the best match $\mathcal{M}^B_0(i)$ is determined. Then, the children of $\mathcal{M}^B_0(i)$ are obtained to be the candidates of the next layer. Several other SegPatches in the trust region are also selected randomly to escape from possible local minimum. Repeat the process until the bottom of the pyramid. Instead of updating the most corresponding SegPatch when traversal processes, obtaining the best match in every layer individually assures the scale stability. Generally, due to the small motion between two consecutive frames, there will not be much change of scales between two corresponding SegPatch. But sometimes, most SegPatches scale up or down when the camera moves forwards or backends. Similarly with the feature points, the consistency of variation for the intensity orientation is checked, and the incompatible matches will be removed in the end.
The 2d-3d matching is used for constructing a stronger connection with the local map. It is a guided match because 2d-3d matching is performed when there is already an estimated or predicted camera pose. Following the principle of geometry, the MapPatch is reprojected onto the current image. As shown in Figure 4, the projected position and size of the MapPatch may not coincide with the optimal SegPatch exactly, but near. Compared to 2d-2d matching, a smaller trust region is required here to find the nearest SegPatch. Matching candidates are also selected through an image pyramid, where the similarity between the latest corresponding SegPatch and the ones within the trust region is measured. Benefiting from the motion priors, the correspondences are constructed with high probability.
The proposed matching algorithm is well applicable to different cases. It is worth noting that our distance calculation method does not focus on the difference in size and shape between the central superpixels, but rather on the similarity in the properties and geometry inside the SegPatch. As a result, the accuracy of the similarity measure is not affected by the observed deformation from multi-view, motion blur, and variance of texture density. The graph search method based on image pyramid reduces the computational complexity. In summary, our proposed matching algorithm accurately provides robust correspondences with a small computational load. Figure 5 shows part of the matches in different intensity density, illustrating the robustness of our method.
4.2. Pose estimation
The camera pose is finally estimated through all these correspondences. Relative motion estimation is ideally performed by maximizing the overlapping area between the matched superpixels in two images.
However, due to the irregular shape of the superpixels, this computation requires the expensive count of overlapping pixels, which cancels the computational advantages of the superpixel representation. For this reason, we simplify the problem to find a suboptimal solution.
Considering two intersecting circles with the same radius $R_\circ$ on the plane, their Intersection over Union (IoU) is a nonlinear function of radius $R_\circ$ and center distance $d_\circ$ :
Let $k = \frac{d_\circ }{2R_\circ }$ , $\phi _{IOU}(R_\circ, d_\circ )$ could be written as
To simplify the computation, $\phi _{IOU}(k)$ could be further approximated as a quadric function of $d_\circ$
Numerical results show that our proposed approximation scheme is efficient. More generally, if there are two circles with different radius ${R_\circ }_1$ and ${R_\circ }_2$ , the IoU is approximated as
As shown in Figure 6, numerical experiments show that when the radii of two circles are similar, the accuracy of the approximation is within an acceptable range for the following use. Given that there is generally no significant scale variation between consecutive images, corresponding SegPatches tend to have similar sizes. Thus, fitting a complex nonlinear function by a simple quadric function is feasible to obtain a suboptimal IoU solution. The Equation (5) is simplified as
where $\mu _{AB} = 1/(2\max ({r1,r2}))^2$ is the weight parameter depending on the size of matched SegPatch pair. Thus, the problem is degraded approximately by minimizing the relative distance of superpixel coordinates. The weighted least square problem could be solved by Gauss-Newton or Levenberg-Marquardt algorithms.
4.3. Local mapping
The local map is updated simultaneously to provide the optimal reconstruction in the surroundings of the camera pose. Newly observed MapPatches are added. A MapPatch is initialized by one frame, but could be observed by others. Therefore, it is reprojected onto the other active frames, and 2d-3d correspondences are searched as detailed above. MapPatch observed from multi-view are fused. By gathering the information from SegPatches in different frames, MapPatch maintains the key information of control points, normal and size. As the camera moves, invisible MapPatches are removed. The local mapping is also in charge of culling redundant frames. Only frames active in the current sliding window and their corresponding MapPatches with high quality are preserved in the local map. The global map is then generated by fusing and optimizing the local map. The proposed superpixel-based visual odometry is processed with continuous pose estimation and mapping.
5. Experiments
We evaluate the accuracy and efficiency of the matching algorithm on several sequences from the publicly available TUM RGB-D dataset [Reference Sturm, Engelhard, Endres, Burgard and Cremers16] and evaluate the localization performance against state-of-the-art techniques. Our method is practicable in these real-world image sequences under different realistic conditions like image noise, motion blur, poor focus, dynamic lighting, and other difficulties. The proposed method was implemented with $C\texttt{++}$ code on a standard Linux computer with four cores at 2.50 GHz and 8 GB of RAM.
5.1. Matching accuracy and efficiency
For each image in a sequence, we construct a 4-layer image pyramid with a scale factor $\beta =2$ . After superpixel segmentation, SegPatch is generated from every superpixel with effective depth and convex polygon.
We evaluate the matching accuracy by the absolute distance between the centers of two matched superpixels and their average overlap in two successive images. The degree of overlap could be approximately quantified as the relative distance $\gamma = \frac{\|c_1-c_2 \|_{_2}}{r_1}$ , where $\pi r_1^2$ is the size of central superpixel and $c_1$ is the ideal location of the SegPatch, $c_2$ is barycenter of the matched SegPatch. We counted more than 16 thousand groups of 2d-2d matched superpixels in images randomly selected in different sequences. Figure 7 shows the displacement for every matched superpixel. Most matched SegPatches are offset by 10 pixels from the ground truth, 2.5 superpixels at superpixel level, shown in Figure 7a and Figure 7b, respectively. The average displacement distribution is shown in Figure 7c, demonstrating regional accuracy.
In Figure 8, we show the average relative distance with respect to the size of the trust region. Figure 8a is the direct statistical distribution for all superpixels, and Figure 8b is the relationship by Gaussian fitting. It illustrates that a balanced trust region size provides higher matching accuracy. In the areas with poor texture, a large trust region will avoid the matching algorithm falling into local optimum, while in the areas with rich texture, a small trust region will protect the accuracy of the algorithm from influence of noise. Therefore, setting the size of the trust region according to the size and scale of SegPatch helps to provide accurate correspondences efficiently.
As for computational efficiency, our approach incorporates superpixels as the fundamental processing units for input images, involving an additional step of superpixel segmentation. Considering size of the image to be $MN$ and the number of features extracted is $K$ , we examine two aspects: feature extraction and matching to analyze the computational complexity. On the one hand, for computational efficiency of feature extraction, we compare the time complexity of our utilized superpixel segmentation and SegPatch extraction method with traditional feature-based methods. Point-feature-based SLAM methods, such as ORB-SLAM [Reference Mur-Artal, Montiel and Tardos2], exhibit a time complexity of $O(MN)$ for feature extraction. Plane-based feature extraction methods using AHC [Reference Feng, Taguchi and Kamat42] or ICP [Reference Besl and McKay43] algorithms have higher time complexities, such as $O((MN)^2)$ and $O((MN)^3)$ , respectively, for plane extraction. The complexity of plane-based method will rise further in high-texture regions. Our utilized superpixel segmentation and SegPatch extraction have a time complexity of $O(MN+K)$ , so no higher complexity is introduced in the feature extraction process. On the other hand, for computational efficiency of matching process, we compare the time complexity of our proposed 2d-2d matching algorithm with SPM [Reference Giraud, Ta, Bugeau, Coupe and Papadakis40]. SPM utilizes the assumption that when a patch in $I_A$ corresponds to a patch in $I_B$ , the adjacent patches in $I_A$ should also match adjacent patches in $I_B$ . All SuperPatches are processed according to a scan order to search and propagate through patch-based approximate nearest neighbor (ANN). The time complexity of SPM is $O(K^2)$ . Due to the uncertainty of random assignment initialization and random search, it takes time to propagate and improve the performance. The running time grows exponentially with the number of iterations. In our algorithm, we build a trust region and search the image pyramid top-down, exploiting the consistency and continuity of the shifts in a single image. It does reduce the search time for matching and increases efficiency. Thus the computational time is reduced to $O(K\lg K)$ . Therefore, considering the aforementioned aspects, the overall time complexity of our algorithm is given by $O(MN+K+K\lg K)$ , indicating its superior efficiency in terms of computation time.
Here, we exclusively concentrate on the comparison of runtime efficiency for methods at superpixel level. The computational cost and the accuracy comparison results with SPM are shown in Table I, where the accuracy is measured by the percentage of matched features within 2.5 equal-size superpixels compared to the ground truth. Computational time and accuracy are given for each frame in a single thread, with the same initial region size and ruler used for superpixel extraction. SPM results are obtained with $k=20$ ANN and $R=50$ neighboring pixels. We show the result of the open-source Matlab version and rewrite it in $\text{C}\texttt{++}$ format for fair comparison. In our method, the matching candidates are selected from the image pyramid within the trust region. And in practice, we do not perform matching on every superpixel in the image. To efficiently triangulate and obtain MapPatches after obtaining correspondences, only SegPatches with valid depth measurements are operated. Multithreaded implementations have also been used to shorten computation time (around 0.02s in practice) to ensure real-time operation of the system. Experiments show that our proposed matching algorithm is efficient and accurate.
5.2. Pose estimation accuracy
We choose some representative sequences from the public TUM-RGBD datasets to evaluate our system. $\texttt{fr1/desk}$ , $\texttt{fr1/room}$ , $\texttt{fr2/desk}$ represent the indoor environment with rich texture intensity, while $\texttt{fr3/str_notex_far}$ and $\texttt{fr3/str_notex_near}$ represent poor texture scenes indoors. We follow the same evaluation protocol as in ref. [Reference Sturm, Engelhard, Endres, Burgard and Cremers16] by computing the maximum drift (deviation from ground-truth pose) across the whole sequence in translation and rotation (represented in quaternion) of camera motion for all trajectories. We compared the estimated camera pose with ground-truth trajectories by computing relative trajectory root-mean-square errors (RMSE) and absolute trajectory (AT) RMSE in various challenging scenes. After projecting the local MapPatches onto the current frame by the estimation from 2d-2d matching results, the pose estimation could limit the relative translation error to around 2.16 cm and rotation error below 1.05 degrees. For most sequences, we achieve comparable results with most mainstream VO algorithms, meaning that our method can maintain correct camera trajectories in challenging scenes. Table II shows the results. It is worth noting that although our method may not be the best under one condition of single texture density, it shows good balance to adapt to the environment shown in Figure 9. Additional work, such as global bundle adjustment and loop closure, is left for future work to become more competitive.
6. Conclusion
We have presented a novel attempt at superpixel-based visual odometry for motion estimation. The design of the superpixel-based feature SegPatch is essential to scene representation. The searching and matching algorithm could construct reliable correspondences of SegPatch in high efficiency. The proposed visual odometry operates robustly under various conditions such as different texture densities, image noise, and motion blur. We have provided an evaluation of the quality of matching and pose estimation on a variety of real-world datasets to show how our method works and demonstrate the utility of SLAM.
Author contributions
All authors contributed to the study conception, design, and implementation. Material preparation, data collection, and analysis were performed by Meiyi Yang. The original draft of the manuscript was written by Meiyi Yang and reviewed and edited by Junlin Xiong and Youfu Li. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Financial support
This work was supported by National Natural Science Foundation of China under Grant 62273320 and 62173286.
Competing interests
The authors declare no competing interests exist.
Ethical approval
Not applicable.