Reinforcement learning-based motion control for snake robots in complex environments

Dong Zhang; Renjie Ju; Zhengcai Cao

doi:10.1017/S0263574723001613

Reinforcement learning-based motion control for snake robots in complex environments

Published online by Cambridge University Press: 12 February 2024

and

Dong Zhang: Affiliation:
College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China
Renjie Ju: Affiliation:
College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China
Zhengcai Cao*: Affiliation:
College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China
*: Corresponding author: Zhengcai Cao; Email: [email protected]

Article contents

Abstract
Introduction
Mathematical model of snake robots
Path planning of snake robots
PI-based motion planning of snake robots
Simulation
Conclusion
Author contributions
Financial support
Competing interests
Ethical approval
References

Rights & Permissions

Abstract

Snake robots can move flexibly due to their special bodies and gaits. However, it is difficult to plan their motion in multi-obstacle environments due to their complex models. To solve this problem, this work investigates a reinforcement learning-based motion planning method. To plan feasible paths, together with a modified deep Q-learning algorithm, a Floyd-moving average algorithm is proposed to ensure smoothness and adaptability of paths for snake robots’ passing. An improved path integral algorithm is used to work out gait parameters to control snake robots to move along the planned paths. To speed up the training of parameters, a strategy combining serial training, parallel training, and experience replaying modules is designed. Moreover, we have designed a motion planning framework consists of path planning, path smoothing, and motion planning. Various simulations are conducted to validate the effectiveness of the proposed algorithms.

Keywords

snake robot path planning motion planning

Type: Research Article
Information: Robotica , Volume 42 , Issue 4 , April 2024 , pp. 947 - 961

DOI: https://doi.org/10.1017/S0263574723001613 [Opens in a new window]
Copyright: © The Author(s), 2024. Published by Cambridge University Press

1. Introduction

Inspired by biological snakes, engineered snake robots have slim bodies and multiple degree of freedoms (DOFs). They can move flexibly in narrow and complex environments [Reference Liljebäck, Pettersen, Stavdahl and Gravdahl1]. Due to these features, they have good potential to replace human in some special work such as disaster rescuing and monitoring [Reference Hopkins and Gupta2]. They have received great attention since the first one was designed by Hirose in 1972 [Reference Hirose3]. Compared with wheeled robots, they are more difficult to control due to their complex mechanisms and special gaits [Reference Tanaka and Tanaka4].

In the past years, most researchers focused attention on the design, modeling, and gait control of snake robots. Path planning and path following are also necessary for snake robots to move in complex environment automatically [Reference Wang, Yang, Shen and Shao5]. Mathematical models are considered as bases of their path planning and following. In ref. [Reference Singh, Gong and Choset6], a unique mathematical model in which contact forces are mapped on the basis of a viscous friction model is presented. It predicts behaviors of a robot in a cluttered environment accurately. In ref. [Reference Liljebäck, Pettersen, Stavdahl and Gravdahl7, Reference Liljebäck, Pettersen, Stavdahl and Gravdahl8], obstacle-aided locomotion strategies are researched for modular snake robots in cluttered environments. In ref. [Reference Li, Zhang, Li, Wu, Law, Xu, Song and Zhu9], Li designs an anti-sideslip line-of-sight method-based adaptive path following controller for snake robots. Later, he proposes a novel stable path-following method to fulfill formation control for multi-snake robots [Reference Li, Zhang, Law, Wu and Xu10]. An obstacle-aided locomotion method is proposed by using piecewise helixes to improve the mobility of snake robots in messy environments in ref. [Reference Takanashi, Nakajima, Takemori and Tanaka11]. In ref. [Reference Liu and Dai12], a trajectory planning method is proposed for dual robotic fingers to manipulate cartons in a complex folding process. These studies have advanced the state of the art of multi-DOF robots despite that their models are complex.

To simplify these models, a framework that can generate paths in low-dimensional work space and select generated gaits in a snake robot’s shape space is proposed in ref. [Reference Hatton, Knepper, Choset, Rollinson, Gong and Galceran13]. Typically, a trajectory optimization method is developed by using Lp-norm-based representations to encode collision constraints of robots and obstacle bounding boxes [Reference Alexander, Nak-seung, Erik and Patricio14]. The method fails to deal with complex or unstructured environments. To stabilize constraint manifolds, an analytic smooth feedback control law is proposed in ref. [Reference Rezapour, Pettersen, Liljebäck and Gravdahl15]. A simplified snake robot model is presented for motion control [Reference Fukushima, Yanagiya, Ota, Katsumoto and Matsuno16]. Snake robots can move in multi-obstacle environments by using above methods based on accurate dynamic models. However, it is difficult to derive such models due to their complex mechanisms. Motion planning of snake robots includes two steps: path planning for their centers of mass (CMs) in workspace and the corresponding path following. Many path following methods of snake robots were researched in the past years [Reference Rezapour and Liljebäck17–Reference Liljebäck, Pettersen, Stavdahl and Gravdahl21]. In ref. [Reference Rezapour and Liljebäck17], a guidance-based path following law is designed for them. A dynamic model and a trajectory tracking control law are proposed for them to track paths without singular configuration [Reference Matsuno and Sato18]. In ref. [Reference Li, Pan, Deng and Hu19], an adaptive path following controller of snake robots is designed based on an improved Serpenoid curve. In ref. [Reference Cao, Zhang and Zhou20], an adaptive path following law is proposed and verified for snake robots. However, since snake robots move forward by swinging their bodies, it is difficult to control them moving along desired paths [Reference Liljebäck, Pettersen, Stavdahl and Gravdahl21], as their trajectories are oscillation curves. It is required to calculate control parameters based on path following laws timely. These methods are time-consuming and inefficient.

Different from those model-based methods, suitable control parameters can be obtained without any prior experiences by using neural networks or reinforcement learning (RL) algorithms. In ref. [Reference Liu and Farimani22], an energy-efficient snake gait policy is found by using deep reinforcement learning (DRL) algorithm. A deep deterministic policy gradient algorithm is used to tune a gait model and obtain optimal parameters of central pattern generator [Reference Qiu, Zhang, Lv, Wang, Zhou and Xiong23]. A two multi-layered spiking neural network is designed to achieve 3D gaits for snake robots to track certain moving objects in ref. [Reference Jiang, Otto, Bing, Huang and Knoll24]. Methods in ref. [Reference Liu and Farimani22–Reference Jiang, Otto, Bing, Huang and Knoll24] are more effective than traditional empirical tuning process. However, they are not suitable for complex and changing environments. To improve environmental adaptability of snake robots, a DRL-based framework with a double deep Q-learning-based technique is proposed to learn the optimal policy for reaching goal points in unknown environments in ref. [Reference Khan, Mahmood, Ullah, Ali and Ullah25]. However, main variables of different unknown environments are frictions and stiffness but not obstacles. In ref. [Reference Shi, Dear and Kelly26], RL is used to derive efficient and novel gaits for a terrestrial and aquatic 3-link snake robot. The strategy is verified, but the robot mechanism is simpler than typical multi-link snake robots.

In ref. [Reference Cao, Xiao, Huang and Zhou27], path following of snake robots is investigated by using approximate dynamic programming and neural networks. A path integral (PI) algorithm is suitable for motion planning due to its effective learning and stable numerical solution [Reference Theodorou, Buchli and Schaal28]. It is used in some learning tasks of different robots [Reference Yamamoto, Ariizumi, Hayakawa and Matsuno29–Reference Fang, Zhu and Guo31]. An extension of a PI policy improvement algorithm is proposed to adjust critical parameters automatically [Reference Yamamoto, Ariizumi, Hayakawa and Matsuno29]. The algorithm is tested on three different types of simulated legged robots. In ref. [Reference Chatterjee, Nachstedt, Worgotter, Tamosiunaite, Manoonpong, Enomoto, Ariizumi and Matsuno30], a policy improvement with PI is used to generate goal-directed locomotion of a snake robot with screw-drive units. In ref. [Reference Fang, Zhu and Guo31], PI is used to compute gait parameters of a snake robot. However, some factors, for example, the arrival time and distance are not considered in these studies.

To plan passable paths and follow them without depending on accurate models, this work designs a motion planning framework for snake robots in multi-obstacle environments based on an RL algorithm. In order to plan a short collision-free path, we use a multi-objective fusion adaptive reward mechanism based on a DRL algorithm to design its loss function. Besides, weights of the neural network are updated by an experience replay technology with a gradient back-propagation method. To control a snake robot to follow the planned paths smoothly, we propose a path smooth algorithm by combining a Floyd algorithm [Reference Zhao, Wu, Gui, Xu and Xie32] and a moving average (MA) algorithm. Gait parameters of snake robots can be computed by a PI algorithm. A parallel training strategy is designed to explore the control parameter space of global paths lightly.

The remainder of this article is organized as follows. A model and symbols of snake robots are given in Section 2. A DRL-based path planning algorithm and a path smoothing strategy are proposed in Section 3. A goal-oriented motion planning method based on PI is proposed in Section 4. Simulations are conducted in Section 5 to demonstrate successful path planning and moving through multi-obstacle environments. Conclusions are drawn in Section 6.

Table I. Mathematical symbols.

Figure 1. Kinematic parameters of a $n$ -link snake robot.

2. Mathematical model of snake robots

To accurately control snake robots, we derive a kinematic model of a 7-link snake robot. Its parameters and mathematical symbols are shown in Fig. 1 and explained in Table I. The global frame position of the robot’s CM is calculated as:

(1)

\begin{equation} \boldsymbol{p}={ \left [ \begin{array}{c} p_{x}\\[4pt] p_{y}\\[4pt] \end{array} \right ]}={ \left [\begin{array}{c} \dfrac{1}{nm}\sum ^{n}_{i=1}mx_{i}\\[14pt] \dfrac{1}{nm}\sum ^{n}_{i=1}my_{i}\\[4pt] \end{array} \right ]}={\frac{1}{n}E \left [ \begin{array}{c} \boldsymbol{X}\\[4pt] \boldsymbol{Y}\\[4pt] \end{array} \right ]}, \end{equation}

where $\boldsymbol{E}$ is a transition matrix defined as:

(2)

\begin{equation} E={ \left [ \begin{array}{c@{\quad}c} e^{T} & \textbf{0}_{\textbf{1}\times \boldsymbol{n}}\\[4pt] \textbf{0}_{\textbf{1}\times \boldsymbol{n}} & e^{T} \\[4pt] \end{array} \right ],} \end{equation}

and $\boldsymbol{e}=[1,1,\ldots,1]^{T}\in R^{n}$ . Links must comply the following constraints:

(3)

\begin{equation} x_{i+1}-x_i={ lcos\theta _i+lcos\theta _{i+1}}, \end{equation}

(4)

\begin{equation} y_{i+1}-y_i={ lsin\theta _{i}+lsin\theta _{i+1}}. \end{equation}

Constraints of all links are as follows:

(5)

\begin{equation} D\boldsymbol{X}+lAcos\boldsymbol{\theta }=\boldsymbol{0}, \end{equation}

(6)

\begin{equation} D\boldsymbol{Y}+lAsin\boldsymbol{\theta }=\boldsymbol{0}, \end{equation}

where

(7)

\begin{equation} A={ \left [ \begin{array}{c@{\quad}c@{\quad}c@{\quad}c@{\quad}c} 1 & 1 & 0 & \cdots & 0 \\[4pt] 0 & 1 & 1 & \cdots & 0 \\[4pt] \vdots & \vdots & \vdots & \ddots & \vdots \\[4pt] 0 & 0 & \cdots \ & 1 & 1 \end{array} \right ],} \end{equation}

(8)

\begin{equation} D={ \left [ \begin{array}{c@{\quad}c@{\quad}c@{\quad}c@{\quad}c} 1 & -1 & 0 & \cdots & 0 \\[4pt] 0 & 1 & -1 & \cdots & 0 \\[4pt] \vdots & \vdots & \vdots & \ddots & \vdots \\[4pt] 0 & 0 & \cdots \ & 1 & -1 \end{array} \right ].} \end{equation}

Velocities of all links can be computed as:

(9)

\begin{equation} \begin{split} \dot{\boldsymbol{X}} =lK^T \boldsymbol{S}_{\theta } \dot{\boldsymbol{\theta }}+ \boldsymbol{e}\dot p_x \end{split}, \end{equation}

(10)

\begin{equation} \begin{split} \dot{\boldsymbol{Y}}=-lK^T \boldsymbol{C}_{\theta } \dot{\boldsymbol{\theta }}+ \boldsymbol{e}\dot p_y \end{split}, \end{equation}

where $K=A^T(DD^T)^{-1}D$ and $T=\left[D^{T}, \dfrac{1}{n}\boldsymbol{e}\right]^{T}\in R^{n\times n}$ .

3. Path planning of snake robots

Path planning is important for snake robots to move in multi-obstacle environments. Compared with traditional path planning methods, deep Q network (DQN) is more adaptive for different environments. An end-to-end mapping between passable path and environments can be established. Therefore, we use a DQN-based path planning method to improve snake robots’ adaptability in complex environments.

3.1. Path planning based on DQN

It is well known that RL can be used for agents to maximize their cumulative rewards in unknown environments by exploring and exploiting collected experiences. In RL, exploration is contradictory to utilization. In the exploration, an agent experimenting with novel strategies may improve returns after long exploring. In the utilization, an agent focuses on maximizing rewards through actions. It is difficult to balance them, especially when an environment contains a huge state space or sparse rewards. An excessively large exploration rate may lead to endless exploration activities in some unexplored areas without any rewards. It thus sacrifices too much short-term benefits. An extremely large utilization rate may converge to suboptimal behaviors while ignoring action spaces with huge benefits.

Exploitation and utilization can be balanced explicitly by utilizing an $\varepsilon$ -greedy policy with a random selected exploration rate $\varepsilon \in (0,1)$ . The snake robot chooses an action corresponding to the maximum $Q$ value. For actions with same $Q$ values, the robot randomly selects an action. With a probability of $1-\varepsilon$ , the robot randomly selects an action in the action set.

The final strategy and convergence rate of the agent are determined by a reward function. In this work, the reward function is defined as:

(11)

\begin{equation} r= \left \{ \begin{array}{l@{\quad}r} r_{a}, \quad collision\\[4pt] r_{b}, \quad collisionless\\[4pt] r_{c}, \quad others\\[4pt] \end{array}\!\!\!\!, \right. \end{equation}

where $r_{a}$ is a negative constant, while $r_{b}$ and $r_{c}$ are positive constants.

The pseudocode of the proposed DQN algorithm is shown in Algorithm 1. The algorithm has two key modules. The first one is a biology-inspired experience replay mechanism used to randomize the data. Effected by the mechanism, correlations in the observation sequence are removed and changes in the data distribution are smoothed. Updates of rewards and status obtained from each interaction are saved to update $Q_{\theta }$ .

Algorithm 1: The DQN algorithm

With an empirical playback mechanism, the error between $Q_{\theta }$ and $\tilde{Q}$ is used to update parameters $\boldsymbol{\omega }$ of the neural network by using a gradient back-propagation method. An approximate $Q$ value and a specific strategy can be obtained when $w$ converges.

For the second module, an iterative strategy is used to update action values toward target values periodically. Correlations between them are deduced by this strategy. A snake robot selects and executes action $a_{t}$ from the action set based on history experiences of a restore buffer and a specific strategy in the current state $s_{t}$ . After action $a_{t}$ , the robot switches to a new state $s_{t+1}$ and obtains a reward $r_{t+1}$ .

The snake robot repeats a process of exploration-learning-decision-utilization and gains experiences by interacting with environments. The experience replay technology is used to update its knowledge, that is, as historical experience to select the next action. The flow diagram of the algorithm is shown in Fig. 2. Its parameters are explained in Table II.

3.2. Path smoothing

Due to the simplification of action space, there are many redundant inflection points in the generated paths. The planned path points are not smooth enough for snake robots. To solve this problem, we adopt a Floyd-MA algorithm to smooth the planned paths.

Table II. Parameters of the DQN-based path planning algorithm.

Figure 2. The DQN-based path planning algorithm of snake robots.

The Floyd algorithm is used to select proper path points and MA used to smooth and interpolate the path between two neighbor path points. Points on the quadratic curve are selected as filtered results. New values of center points are computed by averaging their neighbor points, that is,

(12)

\begin{equation} y_{s}(i)=\frac{1}{2n_{p}+1}\left(y\left(i+n_{p}\right)+y\left(i+n_{p}-1\right)+\ldots +y\left(i-n_{p}\right)\right), \end{equation}

where $n_{p}$ is the number of selected path points. Pseudocode of the Floyd method is shown in Algorithm 2. Its parameters are listed in Table III.

4. PI-based motion planning of snake robots

Motion planning of multiple DOFs snake robots can be defined as a stochastic optimal control problem. Inspired by ref. [Reference Fang, Zhu and Guo31], we use a goal-oriented PI-based motion planning method to control a robot to move along the planned paths in complex environments. To plan goal-oriented motion according to the generated path points, the relationship between the stochastic optimal control and PI is analyzed based on a Bellman principle of optimality and a Hamilton–Jacobi–Bellman (HJB) equation. Different from ref. [Reference Fang, Zhu and Guo31], the control difficulty and accuracy of the planned paths are considered in this work. Besides, a parallel training strategy is built to explore the control parameter space of the global path briskly.

Algorithm 2: The Floyd algorithm

Table III. Parameters of the path smoothing algorithm.

4.1. Stochastic optimal control

Stochastic optimal control aims at maximized goal performance. To evaluate performances of the trajectory $\tau _{i}$ between $t_{i}$ and $t_{N})$ , a reward function is defined as:

(13)

\begin{equation} \begin{split} J(\tau _{i})=\phi [\boldsymbol{x}(t_{N}),t_{N}]+\int ^{t_{N}}_{t_{i}}L[\boldsymbol{x}(t),\boldsymbol{u}(t),t]dt, \end{split} \end{equation}

where $\boldsymbol{x}(t)$ and $\boldsymbol{u}(t)$ are the state and control vector at time $t$ , respectively. Parameters $\phi [\cdot ]$ and $L[\cdot ]$ are a terminal reward and an immediate reward at time $t$ , respectively. The minimum cost function in stochastic optimal control is as follows:

(14)

\begin{equation} \begin{split} V_{t_{i}}=\min \limits _{u_{t_{i}}:t_{N}}E_{\tau _{i}}[J(\tau _{i})], \end{split} \end{equation}

where $E_{\tau _{i}}[\cdot ]$ is expectations of all trajectories starting at state $\boldsymbol{x}_{t_{i}}$ . The stochastic dynamic system is considered as:

(15)

\begin{equation} \begin{split} \dot{\boldsymbol{x}_{t}}=\boldsymbol{f}(\boldsymbol{x}_{t},t)+\boldsymbol{G}(\boldsymbol{x}_{t})(\boldsymbol{u}_{t}+\boldsymbol{\varepsilon }_{t})=\boldsymbol{f}_{t}+\boldsymbol{G}_{t}(\boldsymbol{u}_{t}+\boldsymbol{\varepsilon }_{t}), \end{split} \end{equation}

where $\boldsymbol{x}_{t}\in R^{n\times 1}$ is a state vector of the system, $\boldsymbol{G}_{t}=\boldsymbol{G}(\boldsymbol{x}_{t})\in R^{n\times p}$ is a control matrix, $\boldsymbol{f}_{\boldsymbol{t}}=\boldsymbol{f}(x_{\boldsymbol{t}},t)\in R^{n\times 1}$ is a passive dynamics, $\boldsymbol{u}_{t}\in R^{p\times 1}$ is a control vector, and $\boldsymbol{\varepsilon }_{t}\in R^{p\times 1}$ is a Gaussian noise with a zero mean and a variance $\Sigma _{\varepsilon }$ . The immediate reward is defined as:

(16)

\begin{equation} \begin{split} L_{t}=L[\boldsymbol{x}(t),\boldsymbol{u}(t),t]=q_{t}+\frac{1}{2}\boldsymbol{u}_{t}^{T}\boldsymbol{R}\boldsymbol{u}_{t}, \end{split} \end{equation}

where $q_{t}=q(\boldsymbol{x}_{t},t)$ is an arbitrary state-dependent cost function, and $\boldsymbol{R}$ is a positive semi-definite weight matrix of the quadratic control cost. The HJB equation of this stochastic optimal control problem is defined as:

(17)

\begin{equation} \begin{split} -\partial _{t}V_{t}=\min (L_{t}+(\boldsymbol{\nabla }_{x}V_{t})^{T}\boldsymbol{F}_{t})+\frac{1}{2}trace\left(\left(\boldsymbol{\nabla }_{xx}V_{t}\right)\boldsymbol{G}_{t}\boldsymbol{\Sigma _{\varepsilon }}\boldsymbol{G}_{t}^{T}\right), \end{split} \end{equation}

where $\boldsymbol{\nabla }_{x}$ and $\boldsymbol{\nabla }_{xx}$ are the Jacobian matrix and Hessian matrix, respectively. Substituting the optimal control into (17), we can get a second-order nonlinear partial differential equation (PDE):

(18)

\begin{align} -\partial _{t}V_{t} & = q_{t}+(\boldsymbol{\nabla }_{x}V_{t})^{T}f_{t}-\frac{1}{2}(\boldsymbol{\nabla }_{x}V_{t})^{T}\boldsymbol{G}_{t}^{T}\boldsymbol{R}^{-1}\boldsymbol{G}_{t}^{T}(\boldsymbol{\nabla }_{x}V_{t})\nonumber\\[4pt]& \quad +\frac{1}{2} trace\left(\left(\boldsymbol{\nabla }_{xx}V_{t}\right)\boldsymbol{G}_{t}\boldsymbol{\Sigma _{\varepsilon }}\boldsymbol{G}_{t}^{T}\right). \end{align}

4.2. Transformation of HJB into a linear PDE

Due to the complexity of HJB equation, an exponential transformation of the value function is proposed as $V_{t}=-\lambda \lg \Psi _{t}$ . A second-order linear PDE can be obtained as:

(19)

\begin{equation} -\partial _{t}\Psi _{t}=-\frac{1}{\lambda }q_{t}\Psi _{t}+\boldsymbol{f}_{t}^{T}(\boldsymbol{\nabla }_{x}\Psi _{t})+\frac{1}{2}trace\left(\left(\boldsymbol{\nabla }_{xx}\Psi _{t}\right)\boldsymbol{G}_{t}\Sigma _{\varepsilon }\boldsymbol{G}_{t}^{T}\right),\\[4pt] \end{equation}

where $\Psi _{tN}=\exp ({-}\frac{1}{\lambda }\phi _{tN})$ is the boundary condition, and the analytic solution can be obtained according to the Feyman–Kac theorem:

(20)

\begin{equation} \Psi _{t_{i}}=E\left(\Psi _{t_{i}}\exp \left({-}\int _{t_{i}}^{t_{N}}\frac{1}{\lambda }q_{t}dt\right)\right)=E\left(\Psi _{t_{i}}\left(\exp \left({-}\frac{1}{\lambda }\phi _{tN}-\frac{1}{\lambda }\int _{t_{i}}^{t_{N}}q_{t}dt\right)\right)\right).\\[4pt] \end{equation}

In this way, the stochastic optimal control problem is transformed into an approximate PI problem:

(21)

\begin{equation} \Psi _{t_{i}}=\lim \limits _{dt\rightarrow 0}\int p(\boldsymbol{\tau}_{\boldsymbol{i}}|\boldsymbol{x}_{i})\exp \left[-\frac{1}{\lambda }\left(\phi _{t_{N}}+\sum \limits _{j=i}^{N-1}q_{t_{j}}dt\right)\right]d\boldsymbol{\tau}_{\boldsymbol{i}}, \end{equation}

where $\boldsymbol{\tau}_{i}=[\boldsymbol{x}_{t_{i}},\boldsymbol{x}_{t_{i+1}},\ldots,\boldsymbol{x}_{t_{N}}]^{T}$ is a sample path which starts from $\boldsymbol{x}_{t_{i}}$ . The probability of $\boldsymbol{\tau}_{\boldsymbol{i}}$ is as follows:

(22)

\begin{equation} p(\boldsymbol{\tau}_{\boldsymbol{i}}|\boldsymbol{x}_{i})=\frac{\exp \left({-}\frac{1}{\lambda }\right)}{\int \exp \left({-}\frac{1}{\lambda }\right)d\boldsymbol{\tau}_{\boldsymbol{i}}}. \end{equation}

The optimal controls can be simplified as:

(23)

\begin{equation} \begin{split} \boldsymbol{u}_{t_{i}}=\int p(\boldsymbol{\tau}_{i}|\boldsymbol{x}_{i})\boldsymbol{u}_{L}(\boldsymbol{\tau }_{i})d\boldsymbol{\tau }_{i}. \end{split} \end{equation}

However, the above analytic solutions depend on the system model $\boldsymbol{f}_{t}$ . Using the PI algorithm, the probability of trajectory $\tau _{i,k}$ , that is, the $k$ -th path generated randomly in the start state $\boldsymbol{x}_{t_{i}}$ , is updated as:

(24)

\begin{equation} P(\tau _{i,k})=\frac{\exp \left({-}\frac{1}{\lambda }S(\tau _{i,k})\right)}{\sum \limits _{j=1}^{K}\exp \left({-}\frac{1}{\lambda }S\left(\tau _{i,j}\right)\right)}. \end{equation}

The control vector can be iterated as:

(25)

\begin{equation} \begin{split} \boldsymbol{u}=\boldsymbol{u}+\sum \limits _{k=1}^{K}P(\tau _{i,k})\frac{\boldsymbol{R}^{-1}\boldsymbol{G}_{t_{i},k}\boldsymbol{G}_{t_{i},k}^{T}}{\boldsymbol{G}_{t_{i},k}^{T}\boldsymbol{R}^{-1}\boldsymbol{G}_{t_{i},k}}. \end{split} \end{equation}

The objective is to find $\boldsymbol{u}_{i}$ that minimizes the cost function corresponding to the trajectory $\tau _{i}$ generated by the control variable:

(26)

\begin{equation} \begin{split} V=\min \limits _{u}E_{\tau _{i}}[R(\tau _{i})]=\min \limits _{u}\int R(\tau _{i})P(\tau _{i})d\tau _{i} \end{split}. \end{equation}

The probability of trajectory $\tau _{i}$ is computed as:

(27)

\begin{equation} \begin{split} P(\tau _{i})=\frac{S(\tau _{i})}{\sum \limits _{j=1}^{K}\tau _{i}}, \end{split} \end{equation}

where $K$ is the number of generated trajectories and $S(\tau _{i})$ is computed as:

(28)

\begin{equation} S(\tau _{i})=\exp \left({-}\frac{R(\tau _{i})-\min (\boldsymbol{R})}{\max (\boldsymbol{R})-\min (\boldsymbol{R})}\right), \end{equation}

where $\boldsymbol{R}=[R(\tau _{1}), R(\tau _{2}), \cdots, R(\tau _{K})]$ . The control vector $\boldsymbol{u}^{\ast }$ of the next iteration is as follows:

(29)

\begin{equation} \boldsymbol{u}^{\ast }=\sum \limits _{i=1}^{K}P(\tau _{i})\boldsymbol{u}_{i}. \end{equation}

4.3. Policy improvements with PI

A compound serpenoid curve templet is proposed based on research of biological snakes in ref. [Reference Dai, Travers, Dear, Gong and Choset33]. For the lateral undulation gait, snake robots move forward by swinging their yaw joints. Joint angles are defined as:

(30)

\begin{equation} \phi _{i,ref} = \alpha _{h}\sin (\omega _{h} t+(i-1)\beta _{h})+\gamma _{h}, \end{equation}

where $\phi _{i,ref}$ is the reference yaw angle of joint $i$ at time $t$ , and $\alpha _{h}$ , $\omega _{h}$ , $\beta _{h}$ , and $\gamma _{h}$ are amplitude, angular frequency, phase difference, and phase offset, respectively. Gait parameters of the robot can be expressed as a vector $\boldsymbol{U}=[\alpha _{h},\omega _{h},\beta _{h},\gamma _{h}]^{T}$ . Values of initial and the $\xi$ -th iteration of gait parameters are denoted as $\boldsymbol{U}_{\textbf{0}}=[\alpha _{0}, \omega _{0}, \beta _{0}, \gamma _{0}]^{T}$ and $\boldsymbol{U}_{\boldsymbol{\xi}}=[\alpha _{\xi },\omega _{\xi },\beta _{\xi },\gamma _{\xi }]^{T}$ , respectively. Each iteration of training generates $K$ paths randomly. Path $\tau _{i}$ is determined by $\boldsymbol{u}_{\boldsymbol{i}}=[\varepsilon _{\alpha _{h_{i}}},\varepsilon _{\omega _{h_{i}}},\varepsilon _{\beta _{h_{i}}},\varepsilon _{\gamma _{h_{i}}}]^{T}$ , where $\varepsilon$ obeys normal distribution with a zero mean. The corresponding loss function $R(\tau _{i})$ is expressed as:

(31)

\begin{equation} R(\tau _{1})=\sqrt{(x_{0}-x_{g})^{2}+(y_{0}-y_{g})^{2}}. \end{equation}

During the goal-oriented movement of multiple path points, the snake robot may fail to track unsmooth paths. In order to make the robot complete the goal-oriented motion, we design a new loss function that combines the deviation of trajectory and moving difficulty:

(32)

\begin{equation} R(\tau _{1})=\sqrt{(x_{0}-x_{g})^{2}+(y_{0}-y_{g})^{2}}+\left|\arctan \overline{\theta }-\frac{y_{g+1}-y_{g}}{x_{g+1}-x_{g}}\right|, \end{equation}

where $\bar{\theta }=\frac{1}{N}\sum ^{N}_{i=1}\theta _{i}$ is a robot’s forward direction. Parameters of path $i$ in the $(\xi +1)$ - $th$ training are as follows:

(33)

\begin{equation} {\boldsymbol{U}_{\boldsymbol{\xi} +\textbf{1}}^{\boldsymbol{i}}}=\boldsymbol{U}_{\boldsymbol\xi }+ \boldsymbol{u}_{\boldsymbol{i}}=\left[\alpha _{\xi }+\varepsilon _{\alpha _{i}},\omega _{\xi }+\varepsilon _{\omega _{i}},\beta _{\xi }+\varepsilon _{\beta _{i}},\gamma _{\xi }+\varepsilon _{\gamma _{i}}\right]^{T}.\\[4pt] \end{equation}

Gait parameters after $\xi +1$ iterations are updated as:

(34)

\begin{equation} \begin{split} \boldsymbol{U}_{\boldsymbol\xi +\textbf{1}}= \boldsymbol{U}_{\boldsymbol\xi}+ \boldsymbol{u}^{\boldsymbol\ast}. \end{split} \end{equation}

The variance of $\varepsilon$ is related to the loss function value produced by $\boldsymbol{U}_{\boldsymbol\xi}$ after $\xi$ iterations, and the standard deviation in the $(\xi +1)$ - $th$ training is

(35)

\begin{equation} \boldsymbol{\sigma }=\left[\sigma _{\alpha },\sigma _{\omega },\sigma _{\beta },\sigma _{\gamma }\right]^{T}=\left[f_{\alpha }(r), f_{\omega }(r), f_{\beta }(r), f_{\gamma }(r)\right]^{T}. \end{equation}

4.4. Closed-loop control framework

To realize the autonomous movement of snake robots in complex environments, this work designs a closed-loop control framework. As shown in Fig. 3, it is composed of path planning, path smoothing, and motion planning modules.

In path planning, actions are selected by a neural network. The robot gains experiences by continuous interacting with environments. Weights of the network are updated by a back-propagation algorithm based on an experience replay technology. In path smoothing, path points generated by the trained strategy are filtered and adjusted by the proposed Floyd-MA path smoothing algorithm.

In motion planning, obtained reliable path points are used to obtain gait parameters trained by PI algorithm in parallel to achieve goal-oriented motion. Its pseudocode is shown in Algorithm 3.

Algorithm 3 The Goal-oriented Motion Planning Algorithm Based on PI Method

Figure 3. Global framework for motion planning of snake robots.

5. Simulation

To test the proposed DQN-based path planning and PI-based motion planning algorithms, this work conducts extensive simulations in CoppeliaSim and Pycharm.

5.1. Path planning of snake robots

A virtual snake robot prototype is built in CoppeliaSim. Same with our previous work [Reference Cao, Zhang and Zhou20], a robot prototype is constructed. To generate anisotropic friction between a robot and the ground, a pair of passive wheels are equipped at each link. Communication and control of snake robots in CoppeliaSim are implemented through a remote API.

Obstacles are represented by rectangles with arbitrary poses. They are dilated on each side. The dilated distance is the maximum gait amplitude set in the optimization problem. Parameters of the neural network and a path smoothing algorithm are set as Table I and Table II, respectively. Simulation results are exhibited to show the process of motion planning.

As shown in Fig. 4(a), obstacles (red and yellow) are dilated by safety margins (green). The start position of a snake robot is the lower right corner. The target position is the upper left corner. As shown in Fig. 4(b), a passable path in a complex environment is generated by using DQN, but there exist collinear path points and excessive curvature changes between neighboring points. As shown in Fig. 4(c), redundant collinear path points and points with excessive curvature changes are deleted by the Floyd algorithm. As shown in Fig. 4(d), MA improves the path smoothness and reduces the difficulty of path following.

Table IV. Parameters of the PI motion planning algorithm.

Figure 4. Path planning of a snake robot. (a) Simulation environment; (b) path planning; (c) path point filtering; and (d) path smoothing.

Figure 5. Performances of PI in goal-oriented motion of the snake robot. (a) The first stage. (b) The second stage. (c) The third stage.

Figure 6. Values of gait control parameters. (a) The first stage. (b) The second stage. (c) The third stage.

5.2. Motion planning of snake robots

In order to verify the feasibility of the proposed PI-based motion planning algorithm, we conduct extensive simulation in CoppeliaSim. The PI method is used to train three path segments in parallel as prior knowledge, and then the spaces of gait parameters on the global path are explored serially to realize end-to-end mapping between the planned path points and gait parameters. Parameters used in motion planning are listed in Table IV. The standard deviation is computed as:

(36)

\begin{equation} \sigma =f(r)=\left \{ \begin{array}{l@{\quad}r} \sigma _{0}, \quad \quad \,\,\,\, r\geq 0.5 \\[4pt] r\sigma _{0}/2, \quad 0.2\leq r\lt 0.5\\[4pt] r\sigma _{0}/4, \quad 0.05\leq r\lt 0.2\\[4pt] r\sigma _{0}/8, \quad 0.2\leq r\lt 0.05 \end{array}. \right. \end{equation}

As shown in Fig. 5(a)-(c), loss functions of the first stage to the third stage converge to stable states. As shown in Fig. 6(a)-(c), corresponding to the change curve of the parameters of the gait equation as the iteration count increases. These curves converge to stable states from initial states quickly. After training, gait parameters in the first stage converge to $\boldsymbol{U}^{(1)}=[0.6129, 7.2016, 0.7238, 0.0367]^{T}$ as shown in Fig. 6(a). Gait parameters in the second stage converge to $\boldsymbol{U}^{(2)}=[0.5869, 6.5121,$ $0.7915, 0.0227]^{T}$ as shown in Fig. 6(b). Gait parameters in the third stage converge to $\boldsymbol{U}^{(3)}=[0.5914, 6.5964, 0.8163, 0.0381]^{T}$ as shown in Fig. 6(c).

Trajectory of the robot’s CM is shown in Fig. 7. It can be got that the snake robot can move toward to a target smoothly in a multi-obstacle environment controlled by using the proposed path planning and motion planning methods.

Figure 7. A snake robot moves in a multi-obstacle environment controlled by the PI-based motion planning algorithm.

6. Conclusion

This work proposes an RL-based motion planning framework for snake robots. A DQN method is used to plan a passable path in multi-obstacle environment. A multi-objective fusion adaptive reward mechanism is designed for the method. For the mechanism, weights of DQN are updated based on an experience replay technology. A Floyd-MA path smoothing algorithm is proposed to smooth the planned path. To control snake robots to move along the planned paths, a PI algorithm is used to compute gait parameters. Several local paths are in parallel trained to gather experiences based on a PI method to minimize the global deviation from the designed path. Experimental results show that the proposed motion planning algorithm can control snake robots to move autonomously in multi-obstacle environments.

Our future work intends to investigate their motion control methods by combining the path planning, path following, and environmental perception [Reference Cao, Li, Zhang, Zhou and Abusorrah34, Reference Xie, Zhang, Wang, Zhou, Cao, Hu and Abusorrah35] in complex environments with dynamic obstacles.

Author contributions

Dong Zhang, Renjie Ju, and Zhengcai Cao conceived and designed the study and wrote the article.

Financial support

This work is supported by the National Natural Science Foundation of China (52105005), the China Postdoctoral Science Foundation (2021M690320), and State Key Laboratory of Robotics and System (SKLRS-2022-KF-15).

Competing interests

The authors declare no competing interests exist.

Ethical approval

The authors declare none.

References

Liljebäck, P., Pettersen, K. Y., Stavdahl, Ø. and Gravdahl, J.T., “A review on modelling, implementation, and control of snake robots,” Robot Auton Syst 60(1), 29–40 (2012).10.1016/j.robot.2011.08.010CrossRef Google Scholar

Hopkins, J. K. and Gupta, S. K., “Design and modeling of a new drive system and exaggerated rectilinear-gait for a snake-inspired robot,” J Mech Robot 6(2), 021001 (2014).10.1115/1.4025750CrossRef Google Scholar

Hirose, S.. Biologically Inspired Robots: Snake-Like Locomotors and Minipulators (Oxford University Press, England,1993).Google Scholar

Tanaka, M. and Tanaka, K., “Control of a snake robot for ascending and descending steps,” IEEE Trans Robot 31(2), 511–520 (2015).10.1109/TRO.2015.2400655CrossRef Google Scholar

Wang, G., Yang, W., Shen, Y. and Shao, H., “Adaptive Path Following of Snake Robot on Ground With Unknown and Varied Friction Coefficients,” In: IEEE/RSJ International Conference on Intelligent Robots and Systems, Madrid, Spain (IEEE, 2018), 7583–7588.CrossRef Google Scholar

Singh, A., Gong, C. and Choset, H., “Modelling and Path Planning of Snake Robot in Cluttered Environment,” In: 2018 International Conference on Reconfigurable Mechanisms and Robots, Delft, Netherlands (IEEE, 2018) pp. 1–6.CrossRef Google Scholar

Liljebäck, P., Pettersen, K. Y., Stavdahl, Ø and Gravdahl, J. T., “Hybrid modelling and control of obstacle-aided snake robot locomotion,” IEEE Trans Robot 26(5), 781–799 (2010).CrossRef Google Scholar

Liljebäck, P., Pettersen, K. Y., Stavdahl, O. and Gravdahl, J. T.. Snake Robots: Modelling, Mechatronics, and Control: Advances in Industrial Control (Springer Verlag, London, 2012). pp. 8–17.Google Scholar

Li, D., Zhang, B., Li, P., Wu, E. Q., Law, R., Xu, X., Song, A. and Zhu, L., “Parameter estimation and anti-sideslip line-of-sight method-based adaptive path-following controller for a multi-joint snake robot,” IEEE Trans Syst Man Cyber Syst 53(8), 4776–4788 (2023).10.1109/TSMC.2023.3256383CrossRef Google Scholar

Li, D., Zhang, B., Law, R., Wu, E. Q. and Xu, X., “Error-constrained formation path-following method with disturbance elimination for multi-snake robots,” IEEE Trans Ind Electron 71(5), 4987–4998(2024). doi: 10.1109/TIE.2023.3288202.CrossRef Google Scholar

Takanashi, T., Nakajima, M., Takemori, T. and Tanaka, M., “Obstacle-aided locomotion of a snake robot using piecewise helixes,” IEEE Robot Auto Lett 7(4), 10542–10549 (2022).10.1109/LRA.2022.3194689CrossRef Google Scholar

Liu, H. and Dai, J., “An approach to carton-folding trajectory planning using dual robotic fingers,” Robot Auton Syst 42(1), 47–63 (2003).CrossRef Google Scholar

Hatton, R., Knepper, R., Choset, H., Rollinson, D., Gong, C. and Galceran, E., “Snakes on a Plan: Toward Combining Planning and Control,” In: 2013 IEEE International Conference on Robotics and Automation, Karlsruhe, Germany (IEEE, 2013) pp. 5174–5181,CrossRef Google Scholar

Alexander, H. C., Nak-seung, P. H., Erik, I. V. and Patricio, A. V., “Optimal Trajectory Planning and Feedback Control of Lateral Undulation in Snake-Like Robots,” In: Annual American Control Conference (ACC), Milwaukee, WI, USA (IEEE, 2018) pp. 2114–2120 .Google Scholar

Rezapour, E., Pettersen, K. Y., Liljebäck, P. and Gravdahl, J. T., “Path Following Control of Planar Snake Robots Using Virtual Holonomic Constraints,” In: IEEE International Conference on Robotics and Biomimetics, Shenzhen, China (IEEE, 2013) pp. 530–537.CrossRef Google Scholar

Fukushima, H., Yanagiya, T., Ota, Y., Katsumoto, M. and Matsuno, F., “Model predictive path-following control of snake robots using an averaged model,” IEEE Trans Contr Syst Tech 29(6), 2444–2456 (2021).10.1109/TCST.2020.3043446CrossRef Google Scholar

Rezapour, E. and Liljebäck, P., “Path following control of a planar snake robot with an exponentially stabilizing joint control law,” IFAC Proceed Vol 46(10), 28–35 (2013).CrossRef Google Scholar

Matsuno, F. and Sato, H., “Trajectory Tracking Control of Snake Robots Based on Dynamic Model,” In: IEEE International Conference on Robotics and Automation, Barcelona, Spain (IEEE, 2005) pp. 3029–3034.Google Scholar

Li, D., Pan, Z., Deng, H. and Hu, L., “Adaptive path following controller of a multijoint snake robot based on the improved serpenoid curve,” IEEE Trans Ind Electron 69(4), 3831–3842 (2022).CrossRef Google Scholar

Cao, Z., Zhang, D. and Zhou, M., “Direction control and adaptive path following of 3-D snake-like robot motion,” IEEE Trans Cyber 52(10), 10980–10987 (2022).Google Scholar PubMed

Liljebäck, P., Pettersen, K. Y., Stavdahl, O. and Gravdahl, J. T., “Controllability and stability analysis of planar snake robot locomotion,” IEEE Trans Automat Contr 56(6), 1365–1380 (2011).CrossRef Google Scholar

Liu, Y. and Farimani, A., “An energy-saving snake locomotion gait policy using deep reinforcement learning (2021). doi: 10.48550/arXiv.2103.04511 arXiv.CrossRef Google Scholar

Qiu, K., Zhang, H., Lv, Y., Wang, Y., Zhou, C. and Xiong, R., “Reinforcement Learning of Serpentine Locomotion for a Snake Robot,” In: IEEE International Conference on Real-time Computing and Robotics, Xining, China (IEEE, 2021) pp. 468–473.CrossRef Google Scholar

Jiang, Z., Otto, R., Bing, Z., Huang, K. and Knoll, A., “TargetTracking Control of a Wheel-Less Snake Robot Based on a Supervised Multi-Layered SNN,” In: IEEE/RAS International Conference on Humanoid Robots, Las Vegas, NV, USA (IEEE, 2020) pp. 7124–7130.CrossRef Google Scholar

Khan, S., Mahmood, T., Ullah, S., Ali, K. and Ullah, A., “Motion Planning for a Snake Robot Using Double Deep Q-Learning,” In: International Conference on Artificial Intelligence, Islamabad, Pakistan (IEEE, 2021) pp. 264–230.Google Scholar

Shi, J., Dear, T. and Kelly, S., “Deep reinforcement learning for snake robot locomotion,” IFAC-PapersOnLine 53(2), 9688–9695 (2020).CrossRef Google Scholar

Cao, Z., Xiao, Q., Huang, R. and Zhou, M., “Robust neuro-optimal control of underactuated snake robots with experience replay,” IEEE Trans Neur Net Lear Syst 29(1), 208–217 (2018).CrossRef Google Scholar PubMed

Theodorou, E., Buchli, J. and Schaal, S., “A generalized path integral control approach to reinforcement learning,” J Mach Learn Res 11(104), 3137–3181 (2010).Google Scholar

Yamamoto, K., Ariizumi, R., Hayakawa, T. and Matsuno, F., “Path integral policy improvement with population adaptation,” IEEE Trans Cyber 52(1), 312–322 (2022).CrossRef Google Scholar PubMed

Chatterjee, S., Nachstedt, T., Worgotter, F., Tamosiunaite, M., Manoonpong, P., Enomoto, Y., Ariizumi, R. and Matsuno, F., “Reinforcement Learning Approach to Generate Goal-Directed Locomotion of a Snake-like Robot with Screw-Drive Units,” In: International Conference on Robotics in Alpe-Adria-Danube Region, Smolenice, Slovakia (IEEE, 2014) pp. 1–7.CrossRef Google Scholar

Fang, Y., Zhu, W. and Guo, X., “Target directed locomotion of a snake like robot based on path integral reinforcement learning,” Patt Recog Art Intell 32(1), 1–9 (2019).Google Scholar

Zhao, Y., Wu, G., Gui, F., Xu, C. and Xie, Z., “Optimal coordination path selecting method for conduction transformation based on floyd algorithm,” Procedia Comput Sci 162, 227–234 (2019).Google Scholar

Dai, J., Travers, M. Dear, T., Gong, C. and Choset, H., “Robot-Inspired Biology: The Compound-Wave Control Templat,” In: IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA (IEEE, 2015) pp. 5879–5884,CrossRef Google Scholar

Cao, Z., Li, J., Zhang, D., Zhou, M. and Abusorrah, A., “A multi-object tracking algorithm with center-based feature extraction and occlusion handling,” IEEE Ttans Intell Transp 24(4), 4464–4473 (2023).CrossRef Google Scholar

Xie, H., Zhang, D., Wang, J., Zhou, M., Cao, Z., Hu, X. and Abusorrah, A., “Semi-direct multimap slam system for real-time sparse 3-d map reconstruction,” IEEE Trans Instru Measure 72(4502013), 1–13 (2023).Google Scholar