1. Introduction
In recent years, there has been growing interest in the research of agile and high-speed mobile robots designed for rugged or narrow terrain [Reference Rubio, Valero and Llopis-Albert1–Reference Huang, Zhang, Ri, Xiong, Li and Kang4]. Among these, bicycle robots have emerged as a promising platform due to their ability to achieve high-speed locomotion and agile manoeuvres on varied terrains. Reaction wheel bicycle robots (RWBR) are a type of bicycle robot that relies on reaction wheels as auxiliary balancing mechanisms. Compared to other auxiliary balancing mechanisms, such as control moment gyroscopes [Reference Beznos, Formal’sky, Gurfinkel, Jicharev, Lensky, Savitsky and Tchesalin5, Reference Chen, Chu and Zhang6] and mass pendulums [Reference Keo and Yamakita7, Reference He, Deng, Wang, Sun, Sun and Chen8], reaction wheels offer advantages such as simple mechanism design and rapid response [Reference Kanjanawanishkul9, Reference Wang, Cui, Lai, Yang, Chen, Zheng, Zhang and Jiang10].
Previous studies have investigated the effects of strategies on the RWBR balancing control. The proportional-integral-derivative (PID) control was designed to stabilise the roll angle [Reference Kim, An, Yoo and Lee11]. Linear quadratic regulator (LQR) controller was used to achieve balancing control by approximating the linearisation around the equilibrium point [Reference Xiong, Huang, Gu, Pan, Liu, Li and Wang12]. The control of RWBR presents significant challenges, particularly in dealing with inherent uncertainties and disturbances. Traditional control methods often struggle to address these complexities effectively, leading to suboptimal performance and limited adaptability. To address these problems, various robust control strategies were proposed to balance the RWBR, such as robust LQR [Reference Owczarkowski, Horla and Zietkiewicz13] and disturbance observers [Reference Jeong and Chwa14]. Sliding mode control (SMC) has an excellent ability to deal with uncertainties [Reference Tuan and Ha15–Reference Behera, Bandyopadhyay, Cucuzzella, Ferrara and Yu17], which has been developed for balancing control of RWBR [Reference Guo, Liao and Wei18–Reference Chen, Yan, Wang, Shao, Kurniawan and Wang20]. However, the robustness of the sliding mode controller to uncertainties typically comes at the cost of conservative control performance. This trade-off between robustness and control performance remains an open problem.
Many researchers have been striving to combine SMC with other methods to tackle this challenge, such as fuzzy control [Reference Guo, Liao and Wei18], adaptive control [Reference Chen, Liu, Wang, Hu, Zheng, Ye and Zhang21] and reinforcement learning [Reference Zhu, Deng, Zheng, Zheng, Liang and Liu22–Reference Huo, Yu, Liu and Sha24]. A fuzzy sliding mode controller was designed to deal with impulse disturbance and system uncertainty in [Reference Guo, Liao and Wei18], but the determination of fuzzy rules was rather complicated. In [Reference Chen, Liu, Wang, Hu, Zheng, Ye and Zhang21], an adaptive sliding mode controller was proposed, which dynamically adjusts the parameters of the sliding mode controller to optimise the performance of the control. This work only make monotonic adjustments in certain scenarios, which may lead to excessively high system gain and more severe chattering. Our previous work has confirmed that reinforcement learning can improve the control performance of the SMC online [Reference Zhu, Deng, Zheng, Zheng, Liang and Liu22, Reference Zhu, Deng, Zheng, Zheng, Chen, Liang and Liu23], while this combination cannot provide sufficient theoretical stability guarantee.
Adaptive dynamic programming (ADP) algorithm, a kind of reinforcement learning technique, has been used to address various optimal control problems [Reference Guo, Lin, Jiang, Song and Gan25–Reference Liu, Xue, Zhao, Luo and Wei29]. It not only improves control performance while maintaining robustness but also provides theoretical stability guarantee. The linear controller with the offline ADP algorithm was proposed to balance a bicycle robot in [Reference Guo, Lin, Jiang, Song and Gan25]. The online ADP algorithm was studied to deal with the optimal control problem with known dynamics in [Reference Vamvoudakis and Lewis28]. Ref. [Reference Ma, Zhang, Xu, Yang and Wu26] proposed a method to adjust the sliding mode controller of ADP online to optimise the trajectory tracking of mobile robots. However, its online optimisation was based on the prediction of the states of the nominal model, which greatly limits the applicability under uncertainty. In order to directly utilise online data for ADP solutions, researchers have conducted a significant amount of work, which has led to the developments of two main methods. One involves using the model obtained from online data fitting for online prediction [Reference Bhasin, Kamalapurkar, Johnson, Vamvoudakis, Lewis and Dixon30]. The other directly uses online data to optimise the controller, including integral reinforcement learning [Reference Vamvoudakis, Vrabie and Lewis31] and robust adaptive dynamic programming (RADP) [Reference Zhu and Zhao27].
To address above problems, we introduce RADP to optimise the TSMC online for balancing control of RWBR. First, the nonlinear dynamics of RWBR with uncertainties and disturbances are established and the terminal sliding mode controller is set. Then, the problem of optimising the TSMC with stability constraints is formulated. An online actor-critic-based RADP algorithm is proposed to solve optimal control problems. The stability and convergence of the proposed control strategy are proven. The algorithm comparison in simulation demonstrates the advantages of the proposed control strategy. Prototype experiments also validate the control strategy. The main contributions of this paper are summarised as follows.
-
• An online robust actor-critic-based RADP algorithm with robust self-learning terminal sliding mode control
(RS-TSMC) is proposed to optimise the control performance while maintain the robustness of balancing controller for RWBR. The optimisation process is directly based on data collected online without the need for system dynamics.
-
• The controller optimisation problem is transformed into solving the Hamilton–Jacobi–Bellman (HJB) equation, and the system output generated by ADP is constraint according to the range of TSMC parameters. Compared to [Reference Ma, Zhang, Xu, Yang and Wu26], this mechanism improves the conditions for solving the constrained HJB equation, providing a more flexible and adaptable strategy for designing control strategies.
-
• Experimental studies conducted in simulation platform and on a prototype RWBR compared with several recently proposed control strategies show the effectiveness of the algorithm proposed in this paper.
The rest of the paper is organised as follows. The dynamics of the RWBR and the problem formulation are given in Section 1. The online self-learning sliding mode control strategy is proposed with the stability and convergence proof in Section 3. In Section 4, various simulation experiments are performed, and the experimental results for a RWBR prototype are presented. The conclusion is addressed in Section 5. The video of the simulation and the experiments of RWBR prototype are available at the following website: https://github.com/ZhuXianjinGitHub/RSTSMC. (accessed on 30 August 2024).
Throughout the paper, $\left \| \cdot \right \|$ denotes the Euclidean norm, $ \mathrm{diag}\left \{ \cdot \right \}$ represents a diagonal matrix, and $ \otimes$ denotes the Kronecker product.
2. Problem formulation
In this section, the dynamic model of RWBR with uncertainty and disturbance is derived. We also introduce the feedback transformation. In addition, a TSMC is designed. Furthermore, the online optimisation problem for this controller is presented.
2.1. Dynamics model of RWBR
Figure 1 presents the prototype of RWBR, while Figure 2 shows notations. It can be seen that the RWBR consists of five parts, including a rear wheel, body frame, reaction wheel, handlebar and a front wheel (simplified as $R$ , $B$ , $W$ , $H$ and $F$ , respectively) in Figure 2. The details of the notation are shown in Table I.
Following [Reference Zhu, Deng, Zheng, Zheng, Chen, Liang and Liu23], the roll dynamics of the RWBR is presented as follows:
where $J=m_1l_{1}^{2}+m_2l_{2}^{2}+I_1+I_2$ , $M=m_1I_1+m_2I_2$ , $d_1$ and $d_2$ represent unmodelled dynamics and uncertainty.
To make full use of the known dynamics of the system, the dynamics parameters are divided into a nominal part and an uncertainty part.
where $J_N$ , ${I_2}_N$ and $M_N$ are the nominal parameter values, $\overline{\varDelta J}$ , $\overline{\varDelta I_2}$ and $\overline{\varDelta M}$ are the upper bounds of the uncertainties $\varDelta J$ , $\varDelta I_2$ and $\varDelta M$ .
Further, equation (1) can be re-written as
where $d_{1N}=d_1+\varDelta Mg\sin\!(\varphi) -\varDelta J\ddot{\varphi }-\varDelta I_2\ddot{\theta }$ and $d_{2N}=d_2-\varDelta I_2\ddot{\varphi }-\varDelta I_2\ddot{\theta }$ .
2.2. Design of TSMC controller
For the controller design, we first define $\varphi _d$ as the reference roll angle. The $\varphi _d$ , $\dot{\varphi }_d$ and $\ddot{\varphi }_d$ can be obtained as shown in our previous work [Reference Zhu, Deng, Zheng, Zheng, Chen, Liang and Liu23]. Based on the Olfati–Saber transformation mentioned in [Reference Spong, Corke and Lozano33], the following state variables and the feedback transformation are classified.
where $x_1=\varphi -\varphi _d$ , $x_2=\dot{\varphi }-\dot{\varphi }_d$ , $u=\frac{I_{2N}}{\left ( J_N-I_{2N} \right )}M_Ng\sin \left ( x_1 \right ) -\ddot{\varphi }_d-\frac{I_{2N}}{\left ( J_N-I_{2N} \right )}\tau$ and $d^*=\frac{I_{2N}}{\left ( J_N-I_{2N} \right )}\left ( d_{1N}-d_{2N} \right )$ .
Assumption 1. Assuming $d_1$ and $d_2$ are bounded, it is can be get that $d_{1N}$ and $d_{2N}$ are bounded. Then, it is can be easily proved that $d^*$ is bounded. Consider that $\left | d^* \right |\lt L$ , and note that L is an unknown constant.
The sliding mode surface $s$ , the equivalent control $u_{eq}$ and the reaching control $u_r$ of TSMC are designed according to [Reference Yu, Yu and Zhihong32]. The fractional-order terminal attractor replaces the sign item in the classical sliding mode controller, which is beneficial to attenuate chattering.
where $\alpha _i\gt 0$ , $\beta _i\gt 0$ , $q_i$ and $p_i$ $\left ( q_i\lt p_i \right )$ $\left ( i=0,1 \right )$ are positive odd integers.
By selecting appropriate gains, the system will converge to the sufficiently small neighbourhood of the system equilibrium in finite time. According to [Reference Yu, Yu and Zhihong32], $\beta _1=\frac{L}{\left | s^{q_1/p_1} \right |}+\gamma$ and $\gamma \gt 0$ , the sliding mode variable will reach the neighbourhood $\left | s \right |\lt \left ( \frac{L}{\beta _1} \right ) ^{p_1/q_1}$ of the equilibrium in finite time $t_s$ .
Then, define $\xi _s=\left | \left ( \frac{L}{\beta _1} \right ) ^{p_1/q_1} \right |\lt L\prime$ ,
the system state $x_1$ will converge to the sufficiently small neighbourhood $\left | x_1 \right |\lt \left ( \frac{L\prime }{\beta _0} \right ) ^{p_0/q_0}$ of the system equilibrium in finite time $t_{x_1}$ the system equilibrium in finite time with $\beta _0=\frac{L\prime }{\left | x_{1}^{q_0/p_0} \right |}+\gamma \prime$ , $\gamma \prime \gt 0$ .
Remark 1. The parameters $\alpha _1$ and $\beta _1$ influence the reaching process of sliding mode variables. The larger parameters can reduce the time required for convergence and improve the robustness of the controller to uncertainties, while the burden of the actuator is increased and the performance of the controller is more conservative. In this paper, the RADP is introduced to online tune parameters $\alpha _1$ and $\beta _1$ of the TSMC controller (5) with constraints $\kappa =\left [ \varDelta \alpha _1,\varDelta \beta _1 \right ] ^T$ . The main motivation is to improve the control performance while maintain stability and robustness.
Assumption 2. Assuming $\kappa \in \mathcal{K} =\left \{ \kappa _{i\min }\leqslant \kappa _{i}\leqslant \kappa _{i\max } \right \}$ , $\left ( i=1,2 \right )$ . $\mathcal{K}$ is set to guarantee the finite-time convergence. $\mathcal{K}$ and $L$ generally can be obtained through experiments. And the the stability proof is given in [26].
3. Online robust self-learning TSMC
In this section, an online robust self-learning TSMC for RWBR is proposed to improve the control performance and retain the robustness. First, the optimal control problems with stability constraints are formulated. Then, an online actor-critic-based RADP algorithm is designed to approximate the HJB solutions.
Define $u_{adp}$ as the self-learning part of the control, the output of the controller as follows:
where $\zeta =\left [ s,s^{q_1/p_1} \right ] ^T$ .
Taking (9) into (4), the system can be written as
where $X=\left [ \begin{array}{c}x_1\\[3pt] x_2\\[3pt] \end{array} \right ]$ , $A=\left [ \begin{matrix}0& 1\\[3pt] 0& 0\\[3pt] \end{matrix} \right ]$ , $B=\left [ \begin{array}{c}0\\[3pt] 1\\[3pt] \end{array} \right ]$ and $D=\left [ \begin{array}{c}0\\[3pt] d^*\\[3pt] \end{array} \right ]$ .
The optimal problem is considered to be solved by minimising the value function $V_c$ to obtain the optimal policy function $u$ . $V_c$ is defined as
where $Q$ is symmetric positive definite matrices and $r$ is a positive constant. Taking the derivative of (11) along the trajectory of (10), the following Hamiltonian function can be obtained
where $V_{cX}=\frac{\partial V_c}{\partial X}$ . Define $V_{c}^{*}=\underset{U\prime }{\min }\left ( V_c \right )$ to denote the optimal value function, which satisfies
where $V_{cX}^{*}=\frac{\partial V_{c}^{*}}{\partial X}$ . Assuming the minimum of (13) exists and is unique, then we can obtain the optimal control policy $u_{adp}^{*}=\underset{u_{adp}}{arg\min }\left \{ H \right \}$ by $\frac{\partial H}{\partial u_{adp}}=0$ , which is described as
Traditionally, (15) is difficult to get the solution directly. The policy iteration algorithm [Reference Sutton RS34] is adopted to iteratively solve in traditional ADP by the following two steps:
a) given $u^{\left ( i \right )}$ , solve for the $V_{c}^{\left ( i \right )}$ using
b) update the control policy using
where $i=1,2,\cdots$ denotes the iterations. When $i\rightarrow \infty$ , then $V_c\rightarrow V_{c}^{*}$ , $u_{adp}\rightarrow u_{adp}^{*}$ .
It can be seen that the system dynamic is needed in (16) to get $\dot{X}$ . When there is a certain deviation between the nominal model of the system and the actual scene, the optimisation effect based on the nominal model of the system may be affected. In this paper, RADP [Reference Zhu and Zhao27] is used to solve the optimal control problem only by data sampled online.
Consider an arbitrary control input $u=u_{tsmc}+u_s$ and differentiate the value function $V_{c}^{\left ( i \right )}$ .
Integral (18) over an arbitrary interval as follows,
The closed-loop stability of the system is ensured by (9). $V_{c}^{\left ( i \right )}$ and the improved policy $u_{adp}^{\left ( i+1 \right )}$ can be obtained in one calculation, and it does not need knowledge of the system dynamics.
The value function and the policy function are defined as neural network (NN),
After inserting into (19),
Under the gradient descent method, the updating laws for the weights of the critic NN and the actor NN as follows,
where $\eta \left ( t \right ) =2\int _{t-T}^t{\left ( \left ( ru_s \right ) \otimes \varphi \left ( x \right ) \right ) d\tau }-\int _{t-T}^t{\left ( \varphi \left ( x \right ) \otimes r \right ) \otimes \varphi \left ( x \right ) d\tau }\mathbf{vec}\left ( \hat{W}_{a}^{T} \right )$ , $m_s = \left(\phi(x_t) -\phi(x_{t-T})\right)^T \left ( \phi \left ( x_t \right ) -\phi \left ( x_{t-T} \right ) \right ) +\eta ^T\eta +1$ and $m_s$ is used for normalization.
Remark 2. The differences between RS-TSMC proposed in this paper and self-TSMC (S-TSMC) in [26] and R-TSMC (rosbust-TSMC) in [27] are listed as follows. First, the optimisation process in S-TSMC is based on the state prediction of the nominal model, which is not conducive to the online application of the algorithm. To address this problem, this paper employs an iterative form of RADP to optimise TSMC using online data. Second, the optimisation in S-TSMC is performed directly for the state variable $s$ , which is not exactly equivalent to the optimisation for the state $X$ . The optimisation objective in R-TSMC considers only the part of the $u_{adp}$ and not the overall output of the controller. However, the optimisation is directly based on the system state and controller output in RS-TSMC. Third, the optimisation solution in S-TSMC is performed provided that the constraints of the HJB equations have a solution, whereas R-TSMC does not consider the constraints, but RS-TSMC first solves the unconstrained problem of the HJB and subsequently constrain the controller outputs.
The proposed control strategy schemes are illustrated in Algorithm 1 and Figure 3. The stability and the convergence of the proposed control strategy are given in the Appendix.
4. Simulations and experiments
4.1. Simulations
In order to demonstrate the effectiveness of the RS-TSMC controllers proposed in this paper, two cases built in a simulation platform shown in Figure 4 as one of our previous works [Reference Zhu, Deng, Zheng, Zheng, Chen, Liang and Liu23]. And two recently developed methods: S-TSMC [Reference Ma, Zhang, Xu, Yang and Wu26] and R-TSMC [Reference Zhu and Zhao27] are used for comparison. The other simulation factors are the same except for the distinction mentioned in Remark 2. The RWBR is placed on a curved pavement with white noise. The nominal parameters of RWBR are $J_N=0.0368$ , ${I_2}_N=0.0035$ and $M_N=0.2544$ . The true parameters of RWBR used for simulation are $J=0.033$ , $I_2=0.0040$ and $M=0.2742$ . The control period of the controllers is 0.01s. The other parameters are given as follows:
The activation functions of the critic NN and the actor NN are considered as
An overturning moment $d_2$ is added to the system of RWBR. In case 1,
where $j=\left [ \begin{array}{l}1, 3, 7, 11, 13, 15\\[3pt] \end{array} \right ]$ . In case 2,
To clearly demonstrate the superiority of the proposed method, $V_c$ defined in (11) are used to quantitatively estimate the performance, which are shown in Table II. As seen in this table, RS-TSMC reduced the criteria by $39.79\%$ in case 1 and by $15.91\%$ in case 2 to TSMC. It is less than the other two recently developed methods (R-TSMC, S-TSMC), which implies that the proposed method can achieve better control performance with less control effort. Then, details of the simulations of the two cases are discussed.
The simulation results of Case 1 are demonstrated in Figure 5 and Figure 6. Figure 5 gives the norms $\left \| \hat{W}_c \right \|$ and $\left \| \hat{W}_a \right \|$ with respect to the time under RS-TSMC. As shown in Figure 5, $\left \| \hat{W}_c \right \|$ converges after 12 s, and $\left \| \hat{W}_a \right \|$ converges after 20 s. Figure 6 gives the states, the control output, and $V_c$ of four methods. As can be seen, the proposed method has the smallest value of $V_c$ among the four controllers. In sum, it can be concluded that the control performance of the proposed method (RS-TSMC) outperforms the other three methods, which illustrates the superiority of the proposed method.
Figure 7 gives the norm $\left \| \hat{W}_c \right \|$ and $\left \| \hat{W}_a \right \|$ with respect to the time under RS-TSMC. The pulse perturbation has a significant effect on $\left \| \hat{W}_c \right \|$ at 10 s. The $\left \| \hat{W}_a \right \|$ shows regular changes with the pulse disturbance, indicating the regulation effect of the online learning algorithm on the controller output. Figure 8 illustrates the simulation results in Case 2. Similarly, we can conclude that the better control performance is reached and the less control effort is needed with the proposed method in this case.
4.2. Experiments
The RWBR prototype is used to verify the effectiveness of the proposed controller in this subsection. We presented the experiment results of the proposed RS-TSMC controller. We also performed TSMC, R-TSMC and S-TSMC for performance comparisons, which can be found in Figure 9. In the experimental studies, the TSMC algorithm works on ESP32 control board at 50 Hz and the optimising algorithm works on a PC at 25 Hz. Wireless data transmission between ESP32 and PC is achieved via UDP communication protocol. We consider the swing of the handlebars to generate disturbances for the control of the roll angel. The other settings are the same as in the simulations.
Figure 9 demonstrates the experimental results. Within the first 10 s, it can be seen that the $V_c$ of the three optimisation algorithms is slightly higher than that in TSMC, which can also be seen from the curves of $x_1$ , $x_2$ and $\tau$ . The reasons may be as follows: 1) The experimental factors such as initial roll Angle and initial roll angular velocity of RWBR are not completely consistent in different experiments. 2) The processing power of RWBR and PC is limited. With the iterative optimisation of the controller, it is only after 15 s that the three optimisation algorithms gradually outperform TSMC. The main reason lies in the fact that the control period in the RWBR prototype is much lower than that in the simulation experiment. In addition, it is not difficult to find that RS-TSMC almost outperforms the other two optimisation algorithms throughout the experiment. The proposed controller (RS-TSMC) reduced the criteria by 21.79 $\%$ , while R-TSMC and S-TSMC reduced by about 10 $\%$ to TSMC. The experimental results also validate the effectiveness and feasibility of the proposed control strategy.
5. Conclusions
This paper proposes an online RS-TSMC with stability guarantee for balancing control of RWBR under uncertainties, which improves the balancing control performance of RWBR by optimising the constrained output of TSMC. The robust adaptive dynamic programming (RADP) is used to optimise the TSMC only based on data sampled online without system dynamic. The constraint on the parameters of the sliding mode controller is utilised to derive the constraint on the control output at each time step to maintain the stability of the closed-loop system. Experimental studies conduct a simulate platform and on a prototype RWBR compared with several recently proposed control strategies show the effectiveness of the algorithm proposed in this paper.
Author contributions
Conceptualization and methodology, X.Z. (Xianjin Zhu); software, X.Z. (Xianjin Zhu); validation, W.X., Q.Z., Y.D.; writing – original draft preparation, X.Z. (Xianjin Zhu); writing – review and editing, W.X. and Z.C.; visualisation, Q.Z.; supervision, Y.L.; project administration, Y.L. and B.L.; funding acquisition, Z.C. and Y.D. All authors have read and agreed to the published version of the manuscript.
Financial support
This research was funded by the National Natural Science Foundation of China (62203252, 52205008).
Competing interests
The authors declare no conflicts of interest exist.
Ethical approval
Not applicable.
Appendix
Define the errors $\tilde{W}_c=W_c-\hat{W}_c$ and $\tilde{W}_a=W_a-\hat{W}_a$ , $\tilde{W}_c$ , where $W_c$ and $W_a$ represent the ideal coefficients of $V_{c}^{*}$ and $u_{adp}^{*}$ , $\varepsilon _c$ and $\varepsilon _a$ are the approximation errors.
According to (19),
Then substitude $\tilde{W}_c=W_c-\hat{W}_c$ and $\tilde{W}_a=W_a-\hat{W}_a$ to (22),
where $\varepsilon _{HJB}=-\left [ \varepsilon _c\left ( t \right ) -\varepsilon _c\left ( t-T \right ) \right ] -\int _{t-T}^t{\left ( 2r\varepsilon _a\left ( u_s-W_{a}^{T}\varphi \left ( X \right ) \right ) -r\varepsilon _{a}^{2} \right ) d\tau }$ .
Define the Lyapunov candidata $L_y=\frac{1}{2\lambda _1}\tilde{W}_{c}^{T}\tilde{W}_c+\frac{1}{2\lambda _2}\tilde{W}_{a}^{T}\tilde{W}_a$ , its time derivative has,
where $\rho \left ( t \right ) =\left [ \phi ^T\left ( x_t \right ) -\phi ^T\left ( x_{t-T} \right ), \eta ^T\left ( t \right ) \right ] ^T$ and $\tilde{W}=\left [ \tilde{W}_{c}^{T},\tilde{W}_{a}^{T} \right ] ^T$ .
Therefore $\dot{L_y}\leqslant 0$ , if $\left \| \frac{\rho \left ( t \right )}{m_s\left ( t \right )}\tilde{W} \right \| \gt \left \| \frac{\varepsilon _H}{m_s\left ( t \right )} \right \|$ , since $\left \| m_s\left ( t \right ) \right \| \gt 1$ . This provides an effective practical bound for $\left \| \rho \left ( t \right ) \tilde{W} \right \|$ , since $L$ decreases. According to the lemma 2 in [Reference Vamvoudakis and Lewis28], $\tilde{W}_c$ and $\tilde{W}_a$ are ultimately uniformly bounded.