Hostname: page-component-7bb8b95d7b-cx56b Total loading time: 0 Render date: 2024-09-28T20:45:37.304Z Has data issue: false hasContentIssue false

Adaptive reinforcement learning control for a class of missiles with aerodynamic uncertainties and unmodeled dynamics

Published online by Cambridge University Press:  06 July 2023

X. Ning
Affiliation:
National Key Laboratory of Aerospace Flight Dynamics, Northwestern Polytechnical University, Xi’an, China Science and Technology on Electromechanical Dynamic Control Laboratory, Xi’an, China School of Astronautics, Northwestern Polytechnical University, Xi’an, China
S. Cao
Affiliation:
National Key Laboratory of Aerospace Flight Dynamics, Northwestern Polytechnical University, Xi’an, China Science and Technology on Electromechanical Dynamic Control Laboratory, Xi’an, China Xi’an Institute of Electromechanical Information Technology, Xi’an, China
B. Han
Affiliation:
Xi’an Aeronautics Computing Technique Research Institute, Xi’an, China
Z. Wang*
Affiliation:
National Key Laboratory of Aerospace Flight Dynamics, Northwestern Polytechnical University, Xi’an, China Research Center for Unmanned System Strategy Development, Northwestern Polytechnical University, Xi’an, China Northwest Institute of Mechanical and Electrical Engineering, Xianyang, China Unmanned System Research Institute, Northwestern Polytechnical University, Xi’an, China
Y. Yin
Affiliation:
National Key Laboratory of Aerospace Flight Dynamics, Northwestern Polytechnical University, Xi’an, China School of Astronautics, Northwestern Polytechnical University, Xi’an, China
*
Corresponding author: Z. Wang; Email: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

In this paper, a super-twisting disturbance observer (STDO)-based adaptive reinforcement learning control scheme is proposed for the straight air compound missile system with aerodynamic uncertainties and unmodeled dynamics. Firstly, neural network (NN)-based adaptive reinforcement learning control scheme with actor-critic design is investigated to deal with the tracking problems for the straight gas compound system. The actor NN and the critic NN are utilised to cope with the unmodeled dynamics and approximate the cost function that are related to control input and tracking error, respectively. In other words, the actor NN is used to perform the tracking control behaviours, and the critic NN aims to evaluate the tracking performance and give feedback to actor NN. Moreover, with the aid of the STDO disturbance observer, the problem of the control signal fluctuation caused by the mismatched disturbance can be solved well. Based on the proposed adaptive law and the Lyapunov direct method, the eventually consistent boundedness of the straight gas compound system is proved. Finally, numerical simulations are carried out to demonstrate the feasibility and superiority of the proposed reinforcement learning-based STDO control algorithm.

Type
Research Article
Copyright
© The Author(s), 2023. Published by Cambridge University Press on behalf of Royal Aeronautical Society

Nomenclature

$\alpha $ , $\beta $

attack angle and slide angle

${\omega _z}$ , ${\omega _y}$

angle velocities

${J_z}$ , ${J_y}$

rotational inertia

${\delta _z}$ , ${\delta _y}$

output signals of the elevator and rudder

m

mass of the missile

V

velocity of the missile

S

reference area of the missile

L

reference length of the missile

$C_{\left( \cdot \right)}^{\left( \cdot \right)}$

coefficients of the aerodynamic forces

$m_{\left( \cdot \right)}^{\left( \cdot \right)}$

coefficients of the aerodynamic moments

${d_i}\!\left( {i = {\omega _z},{\omega _y}} \right)$

unknown disturbances

${\chi _i}\!\left( {i = z,y} \right)$

uncertainties caused by the unmodeled dynamics

$\eta \!\left( t \right)$

unmodeled dynamics

${y_d}\!\left( t \right)$

desired tracking signal

${x_{2c}}\!\left( t \right)$

inner loop virtual signal

${e_1}\!\left( t \right)$ , ${e_2}\!\left( t \right)$

tracking errors

${k_0}$ , ${k_1}$ , ${k_2}$

positive control gains

$r\!\left( t \right)$

dynamic auxiliary signal

${\rm{\Delta }}f$

system unknown nonlinear term

${W_a}$ , ${W_c}$

weights of actor NN and critic NN

${{\rm{\Phi }}_a}\!\left( {{Z_a}} \right)$ , ${{\rm{\Phi }}_c}\!\left( {{Z_c}} \right)$

activation functions of actor NN and critic NN

${\varepsilon _{{W_a}}}$ , ${\varepsilon _{{W_c}}}$

estimation errors of actor NN and critic NN

${\varepsilon _v}$

unknown upper bound of the total disturbance $D\!\left( t \right)$

$u\!\left( t \right)$

designed controller

$J\!\left( t \right)$

integral penalty function

${e_c}\!\left( t \right)$

error variable of critic NN

${E_c}\!\left( t \right)$

error objective function of critic NN

$\hat \cdot $

estimation value of $ \cdot $

$\tilde \cdot $

estimation error of $ \cdot $ and $\tilde \cdot = \hat \cdot - \cdot $

1.0 Introduction

The missiles are a class of weapons that are equipped with guidance and control equipment to achieve precision flight and strike missions. The typical features of long range, high accuracy, great power, and strong defense penetration capabilities make the missiles an important research area. Recently, the high-precision control problem of missiles has become a relatively important research topic and has been extensively studied by scholars at home and abroad. In order to tackle this problem, several approaches have been proposed [Reference Antonios and Brian1]. For example, a robust control scheme based on the quaternion feedback is proposed for the attitude control problem of the missile [Reference Song, Kim, Kim and Nam2]. In Ref. [Reference Yang, Chen and Li3], the robust control method based on disturbance observer is proposed, which not only ensures the robustness of the nonlinear system, but also solves the problem of mismatched disturbance by using the observer, and has been successfully applied to the nonlinear missile systems with various uncertain relations and external disturbances. In addition to the robust control method, sliding mode control is also favoured by scholars because of its unique advantages. Considering the condition that the attitude of the missile is affected by the rapid and large parameter variations and the partial instability in the boost phase, a multi-sliding surface attitude controller based on high-order sliding mode and traditional sliding mode is proposed to control the attitude of the missile [Reference Lee, Kim and Moon4]. Although sliding mode control theory has many advantages, it is prone to chattering during the design process, which will cause harm to the system. In light of this situation, two novel smooth sliding mode control methods were proposed, and successfully achieved the fast finite time convergence of the system [Reference Zhou and Yang5]. Backstepping decomposes a complex nonlinear system into several subsystems, which has the ability to deal with mismatching uncertainties. Introducing the backstepping method into the missile control system makes the system more flexible and robust. By utilising the backstepping method, the guidance and control law is divided into a guidance loop and a control loop for design, and the state observer is used for online estimation and compensation of the aerodynamic parameter changes in the model [Reference Shao and Wang6]. The above-mentioned literatures adopt the single control method, compared with this, the compound control method can more effectively improve the performance of the system. Therefore, by a combination of backstepping and sliding mode control methods, the attitude controller is devised for a rotating missile with two moving masses inside [Reference Guo, Yang and Zhao7]. Moreover, in order to depress the peaking phenomenon and chasing of backstepping sliding mode controller, filtering technology is introduced [Reference Wang, Zhang, Wang, Peng and Yang8].

However, traditional aerodynamically controlled missiles have the problem of long overload response time. At present, by increasing the direct lateral force to form the aerodynamic/reaction-jet compound control system, the dynamic performance of the missile control system can be improved. In Ref. [Reference Fan, Li, Yang and Zhang9], a robust controller of the aerodynamic/reaction-jet compound control missile is designed, and in order to ensure the robustness when the jet factor changes, the parameter space method is used to design the equivalent steering gear system. Moreover, by a combination of robust trail tracking control and dynamic inverse control, a blended robust control method is devised to deal with the blended attitude control with lateral thrust and aerodynamic force [Reference Shao, Zhang and Cao10]. To realise the coordinated use of direct force and aerodynamic force and fast and accurate tracking of overload, a compound control strategy based on fixed time convergence sliding mode control theory and dynamic control allocation technology is proposed [Reference Liu, Li, Guo, Wang and Wang11]. For the case of parameter perturbation and large external disturbances, a nonsingular fast terminal sliding mode control method is proposed, which improves the convergence speed of the system [Reference Zhao, Liao, Duan and Zhang12]. In Refs. [Reference Xu and Zhou13, Reference Zhang14], backstepping method is used to design virtual control law to complete the control problem of compound missile.

Above controllers are designed based on modern control theory, nevertheless, pure modern control theory cannot solve many problems of missile engineering application. With the development of artificial intelligence technology, many scholars try to combine artificial intelligence and control theory to solve the problem of missile control. The fuzzy control theory is introduced into the control design of compound missile, and the overload command is tracked by designing fuzzy controller [Reference Shi, Ma, Zhang and Lin15]. Moreover, the combination of artificial intelligence integral controller and fuzzy controller is helpful to improve the stability of missile guidance system [Reference Luo and Zhang16]. In Ref. [Reference Liu, Qu and Liu17], variable universe fuzzy control is introduced to solve the influence of aerodynamic parameters on missile control system. The typical features of parallel processing, distributed storage, high fault tolerance and nonlinear operation make artificial neural network particularly popular. The neural network reference model method is used to design the equivalent steering gear of composite missile [Reference Fan and Yang18]. Genetic algorithm has great advantages in optimising the control system. For instance, in Ref. [Reference Zhou, Peng and Li19], the gain matrix of missile controller is optimised by genetic algorithm, and simulation results show that the optimisation effect of this method cannot be achieved by traditional optimisation methods. With the rapid development of computer technology, adaptive technology has attracted more and more attention from scholars. The combination of adaptive control and intelligent control algorithm is an important direction of missile control research. In Ref. [Reference Dong, Chen, Song and Cao20], fuzzy adaptive proportional–integral–derivative (PID) control is designed to solve the problem that PID controller cannot adjust parameters. Based on the robust adaptive controller, a fuzzy adaptive disturbance observer is used to compensate the disturbances in the linear velocity and angular velocity dynamics of the missile [Reference Chwa21]. The control performance of hypersonic missile in cruise phase is improved by the combination of fuzzy control and adaptive sliding mode control [Reference Yang, Fang, Chai and Wu22]. In Ref. [Reference Cai, Xing, Zhang and Shen23], a robust adaptive neural network state feedback control based on backstepping is proposed for missile systems with unknown parameters and unknown delay inputs, and an approximator based on neural network is used to compensate the uncertainty caused by unknown delay. For the missile with random disturbances and non-affine aerodynamic characteristics, the neural network is used to deal with the non-affine aerodynamic characteristics in the system, and the adaptive term is used to solve the problem of unknown target manoeuver [Reference Chen24]. An improved adaptive genetic algorithm is proposed to solve the nonlinear integer programming model of large-scale missile firepower allocation. Compared with the traditional genetic algorithm, the crossover probability and mutation probability automatically adjusted by the adaptive rule significantly improve the search ability of the algorithm, so as to improve the accuracy of the model [Reference Zhang, Chen and Hao25]. In view of the large jet interference of missile with lateral jets and aerodynamic surfaces, the control allocation algorithm is designed by using adaptive genetic algorithm to meet the real-time requirements of the algorithm; then the variable universe adaptive fuzzy control is used to design the ignition algorithm of attitude control engine, which overcomes the influence of jet interference and solves the problem of low precision of conventional fuzzy control [Reference Shi, Ma and Wang26].

Motivated by the above discussions, a STDO-based adaptive reinforcement learning control method is proposed for the straight air compound missile system with aerodynamic uncertainties and unmodeled dynamics. The main contributions of this paper can be summarised as follows.

  • To deal with the tracking problems for the straight gas compound system, adaptive control with actor-critic design is investigated in this paper: the critic part is used to obtain the cost function to evaluate the tracking performance, and the actor part generates the control policy of the actuator according to the results from the critic part.

  • To improve the control performance, reinforcement learning and neural networks are adopted in the actor-critic design: the critic neural network and the actor neural network are utilised to approximate the cost function and cope with the unmodeled dynamics, respectively.

  • Considering that the negative impacts of the control signal fluctuation caused by the disturbances of the straight gas compound system, the STDO disturbance observer is used to solve the problem.

2.0 Problem formulation and preliminaries

2.1 Problem statement

Ignoring the roll channel, the attitude dynamic model of a direct force and aerodynamic force compound missile can be modeled as

(1) \begin{align}\begin{array}{l} \dot \alpha = {\omega _z} - \dfrac{{QS\!\left( {C_y^\alpha \alpha + C_y^{{\delta _z}}{\delta _z}} \right)}}{{mV}} - \dfrac{{{F_y}}}{{mV}}\\[5pt] \dot \beta = {\omega _y} + \dfrac{{QS\!\left( {C_z^\beta \beta + C_z^{{\delta _y}}{\delta _y}} \right)}}{{mV}} + \dfrac{{{F_z}}}{{mV}}\\[5pt] {{\dot \omega }_z} = \dfrac{{QSL}}{{{J_z}}}\!\left( {m_z^\alpha \alpha + m_z^{{\delta _z}}{\delta _z} + m_z^{{{\bar \omega }_z}}{{\bar \omega }_z}} \right) + \dfrac{{{l_z}}}{{{J_z}}}{F_y} + {d_{{\omega _z}}} + {\chi _z}\!\left( {{\omega _z},\eta } \right)\\[5pt] {{\dot \omega }_y} = \dfrac{{QSL}}{{{J_z}}}\!\left( {m_y^\beta \beta + m_y^{{\delta _y}}{\delta _y} + m_y^{{{\bar \omega }_y}}{{\bar \omega }_y}} \right) + \dfrac{{{l_y}}}{{{J_y}}}{F_z} + {d_{{\omega _y}}} + {\chi _y}\!\left( {{\omega _y},\eta } \right)\end{array} \end{align}

where $\alpha $ is the angle-of-attack, $\beta $ is the slide angle, ${\omega _z}$ , ${\omega _y}$ are the angle velocities. ${J_z}$ , ${J_y}$ denote the rotational inertia. $V$ and $m$ are the velocity and the mass of the missile, respectively. $S$ and $L$ represent the reference area and length. ${\delta _z}$ , ${\delta _y}$ are the output signals of the elevator and rudder. $C_{\left( \cdot \right)}^{\left( \cdot \right)}$ denote the coefficients of the aerodynamic forces, while $m_{\!\left( \cdot \right)}^{\!\left( \cdot \right)}$ represent the coefficients of the aerodynamic moments. ${d_i}\!\left( {i = {\omega _z},{\omega _y}} \right)$ are the unknown disturbances. $\eta $ is the unmodeled dynamics, and ${\chi _i}\!\left( {i = z,y} \right)$ represents the uncertainties caused by the unmodeled dynamics.

Then, by defining

(2) \begin{align}\begin{array}{l} {x_1}\!\left( t \right) = {\left[\! {\begin{array}{*{20}{c}} \alpha &\beta \end{array}} \right]^T},{x_2}\!\left( t \right) = {\left[\! {\begin{array}{*{20}{c}} {{\omega _z}}&{{\omega _y}} \end{array}} \right]^T},\\[9pt] {d_1}\!\left( t \right) = \left[ {\begin{array}{*{20}{c}} { - \dfrac{{QS\!\left( {C_y^\alpha \alpha + C_y^{{\delta _z}}{\delta _z}} \right)}}{{mV}} - \dfrac{{{F_y}}}{{mV}}}\\[9pt] {\dfrac{{QS\!\left( {C_z^\beta \beta + C_z^{{\delta _y}}{\delta _y}} \right)}}{{mV}} + \dfrac{{{F_z}}}{{mV}}} \end{array}} \right],{d_2}\!\left( t \right) = \left[ {\begin{array}{*{20}{c}} {\dfrac{{QSL}}{{{J_z}}}\!\left( {m_z^\alpha \alpha {\rm{ + }}m_z^{{{\bar \omega }_z}}{{\bar \omega }_z}} \right) + {d_{{\omega _z}}}}\\[9pt] {\dfrac{{QSL}}{{{J_z}}}\!\left( {m_y^\beta \beta {\rm{ + }}m_y^{{{\bar \omega }_y}}{{\bar \omega }_y}} \right) + {d_{{\omega _y}}}} \end{array}} \right]\\[9pt] B = \left[ {\begin{array}{*{20}{c}} {\dfrac{{QSL}}{{{J_z}}}m_z^{{\delta _z}}}&0&{\dfrac{{{l_z}}}{{{J_z}}}}&0\\[9pt] 0&{\dfrac{{QSL}}{{{J_z}}}m_y^{{\delta _y}}}&0&{\dfrac{{{l_y}}}{{{J_y}}}} \end{array}} \right],u\!\left( t \right) = {\left[ {\begin{array}{*{20}{c}} {{\delta _z}}&{{\delta _y}}&{{F_y}}&{{F_z}} \end{array}} \right]^T}\end{array} \end{align}

the equivalent model of Equation (1) can be given as

(3) \begin{align}\begin{array}{l}{{\dot x}_1}\!\left( t \right) = {x_2}\!\left( t \right) + {d_1}\!\left( t \right)\\[5pt] {{\dot x}_2}\!\left( t \right) = Bu\!\left( t \right) + {d_2}\!\left( t \right) + \chi \!\left( {{x_2}\!\left( t \right),\eta \!\left( t \right)} \right)\end{array}\end{align}

Thus, the design objective is to develop a reinforcement learning-based STDO control scheme to maintain the desired trajectory tracking for the straight air compound missile system given in Equation (3) subjected to aerodynamic uncertainties and unmodeled dynamics.

2.2 Assumptions and lemmas

The following assumptions and lemmas are necessary.

Assumption 1. The disturbance moments caused by structural uncertainty are bounded, that is, there exists the constants ${\bar d_1}$ , ${\bar d_2}$ such that $\|{d_1}\!\left( t \right)\| \le {\bar d_1}$ , $\|{d_2}\!\left( t \right)\| \le {\bar d_2}$ .

Assumption 2. The desired tracking signal of the system ${y_d}\!\left( t \right)$ is smooth and twice differentiable.

Lemma 1. For any constant $\varepsilon \gt 0$ and vector $\xi \in {R^n}$ , we have

(4) \begin{align}\!\left\| \xi \right\| < \frac{{{\xi ^T}\xi }}{{\sqrt {{\xi ^T}\xi + {\varepsilon ^2}} }} + \varepsilon \end{align}

Lemma 2. [Reference Polycarpou and Ioannou27] Given any constant $\varepsilon \gt 0$ and any variable $z \in R$ , the following inequality holds

(5) \begin{align}0 \le \!\left| z \right| - z\tanh \!\left( {\frac{z}{\varepsilon }} \right) \le \kappa \varepsilon \end{align}

where $\kappa $ is a constant satisfying $\kappa = {{\rm{e}}^{ - \!\left( {\kappa + 1} \right)}}$ , i.e. $\kappa = 0.2785$ .

3.0 Stdo-based adaptive reinforcement learning control

In this section, as shown in Fig. 1, a STDO-based adaptive reinforcement learning control method is proposed.

Figure 1. The structure of the proposed reinforcement learning-based STDO control algorithm.

Defining the control expected output signal as ${y_d}\!\left( t \right)$ and the inner loop virtual signal as ${x_{2c}}\!\left( t \right)$ , then the tracking errors of ${x_1}\!\left( t \right)$ and ${x_2}\!\left( t \right)$ can be expressed as

(6) \begin{align}\begin{array}{l}{e_1}\!\left( t \right) = {x_1}\!\left( t \right) - {y_d}\!\left( t \right)\\[5pt] {e_2}\!\left( t \right) = {x_2}\!\left( t \right) - {x_{2c}}\!\left( t \right)\end{array}\end{align}

Combining with Equation (3), one has

(7) \begin{align}\begin{array}{l}{{\dot e}_1}\!\left( t \right) = {x_{2c}}\!\left( t \right) + {e_2}\!\left( t \right) + {d_1}\!\left( t \right) - {{\dot y}_d}\!\left( t \right)\\[5pt] {{\dot e}_2}\!\left( t \right) = Bu\!\left( t \right) + {d_2}\!\left( t \right) + \chi \!\left( {{x_2}\!\left( t \right),\eta \!\left( t \right)} \right) - {{\dot x}_{2c}}\!\left( t \right)\end{array}\end{align}

Then, we can design the inner loop virtual signal as

(8) \begin{align} {x_{2c}}\!\left( t \right) = - {k_0}\int_0^t {{e_1}\!\left( \tau \right)d\tau } - {k_1}{e_1}\!\left( t \right) - \hat{d}_{1}\!\left( t \right) + \dot{y}_{d}\!\left( t \right) \end{align}

where ${\hat{d}_1}\!\left( t \right)$ is the adaptive estimate of ${d_1}\!\left( t \right)$ , ${k_0}$ and ${k_1}$ are the control gains.

In order to compensate and suppress the influence of the unknown mismatched disturbance ${d_1}\!\left( t \right)$ , a second-order STDO is designed as follows

(9) \begin{align} \begin{array}{l}\dot{\hat d}_1\!\left( t \right) = - {K_{{d_1}}}\!\left( {{{\hat d}_1}\!\left( t \right) - {P_1}} \right)\\[5pt] {P_1} = - {K_{{P_1}}}\dfrac{{{{\hat x}_1} - {x_1}}}{{{{\!\left\| {{{\hat x}_1} - {x_1}} \right\|}^{\frac{1}{2}}}}} + {P_2}\\[5pt] {{\dot P}_2} = - {K_{{P_2}}}\dfrac{{{{\hat x}_1} - {x_1}}}{{\!\left\| {{{\hat x}_1} - {x_1}} \right\|}}\\[5pt] \dot {\hat x}_1 \!\left( t \right) = {x_2}\!\left( t \right) + {{\hat d}_1}\!\left( t \right)\end{array} \end{align}

The dynamic signal $r\!\left( t \right)$ is introduced, which is defined by

(10) \begin{align}\dot r\!\left( t \right) = - {\gamma _0}r\!\left( t \right) + \rho \!\left( {{x_1}\!\left( t \right),{x_2}\!\left( t \right)} \right),r\!\left( 0 \right) = {r_0}\end{align}

where ${\gamma _0} \in \left( {0,{\rm{\;\;}}{\gamma _1}} \right)$ .

The coupling uncertainty is assumed to satisfy the following inequality

(11) \begin{align}e_2^T\chi \!\left( {{x_2}\!\left( t \right),\eta \!\left( t \right)} \right) \le \!\left\| {e_2^T\!\left( t \right)} \right\|\!\left( {{\varphi _1}\!\left( {{x_2}\!\left( t \right)} \right) + {\varphi _2}\!\left( {\eta \!\left( t \right)} \right)} \right)\end{align}

According to Lemma 1 and Young’s inequality, Equation (11) can be rewritten as

(12) \begin{align}\begin{array}{l}\!\left\| {e_2^T\!\left( t \right)} \right\|{\varphi _1}\!\left( {{x_2}\!\left( t \right)} \right) \le e_2^T\!\left( t \right){{\bar \varphi }_1}\!\left( {{e_2}\!\left( t \right),{x_2}\!\left( t \right)} \right) + {\varepsilon _1}\\[5pt] \!\left\| {e_2^T\!\left( t \right)} \right\|{\varphi _2}\!\left( {\eta \!\left( t \right)} \right) \le e_2^T\!\left( t \right){{\bar \varphi }_2}\!\left( {{e_2}\!\left( t \right),r\!\left( t \right)} \right) + {\varepsilon _2} + \dfrac{1}{4}e_2^T\!\left( t \right){e_2}\!\left( t \right) + {\varepsilon _3}\end{array}\end{align}

where ${\varepsilon _1},{\rm{\;\;}}{\varepsilon _2}\gt 0$ are arbitrary constants,

(13) \begin{align}\begin{array}{l}{{\bar \varphi }_1}\!\left( {{e_2}\!\left( t \right),{x_2}\!\left( t \right)} \right) = \dfrac{{{\varphi _1}\!\left( {{x_2}\!\left( t \right)} \right)e_2^T\!\left( t \right){\varphi _1}\!\left( {{x_2}\!\left( t \right)} \right)}}{{\sqrt {{{\left[ {e_2^T\!\left( t \right){\varphi _1}\!\left( {{x_2}\!\left( t \right)} \right)} \right]}^2} + \varepsilon _1^2} }}\\[5pt] {{\bar \varphi }_2}\!\left( {{e_2}\!\left( t \right),r\!\left( t \right)} \right) = \dfrac{{{\varphi _2} \circ \alpha _1^{ - 1}\!\left( {2r\!\left( t \right)} \right)e_2^T\!\left( t \right){\varphi _2} \circ \alpha _1^{ - 1}\!\left( {2r\!\left( t \right)} \right)}}{{\sqrt {{{\left[ {e_2^T\!\left( t \right){\varphi _2} \circ \alpha _1^{ - 1}\!\left( {2r\!\left( t \right)} \right)} \right]}^2} + \varepsilon _2^2} }}\\[5pt] {\varepsilon _3} = {\left[ {{\varphi _2} \circ \alpha _1^{ - 1}\!\left( {2{\varepsilon _r}} \right)} \right]^2}\end{array}\end{align}

Next, we define

(14) \begin{align}\Delta f = {\bar \varphi _1}\!\left( {{e_2}\!\left( t \right),{x_2}\!\left( t \right)} \right) + {\bar \varphi _2}\!\left( {{e_2}\!\left( t \right),r\!\left( t \right)} \right)\end{align}

Since ${\bar \varphi _1}\!\left( {{e_2}\!\left( t \right),{x_2}\!\left( t \right)} \right)$ and ${\bar \varphi _2}\!\left( {{e_2}\!\left( t \right),r\!\left( t \right)} \right)$ are the functions that change irregularly in the control dynamic process, the actor NNs are introduced to approximate the unknown nonlinear term ${\rm{\Delta }}f$ , the actor NN structures of the optimal control ${\rm{\Delta }}f$ and the actual control ${\rm{\Delta }}\hat{f}$ are designed as follows

(15) \begin{align}\begin{array}{l}\Delta f = W_a^T{\Phi _a}\!\left( {{Z_a}} \right) + {\varepsilon _{{W_a}}}\\[5pt] \Delta {{\hat f}_{}} = \hat W_a^T{\Phi _a}\!\left( {{Z_a}} \right)\end{array}\end{align}

where ${W_a},{\hat W_a} \in {{\rm{R}}^{{p_1} \times n}},{{\rm{\Phi }}_a}\!\left( {{Z_a}} \right) \in {{\rm{R}}^{{p_1} \times 1}},{{\rm{\Phi }}_a}\!\left( {{Z_a}} \right) = {e^{ - \frac{{{{\!\left( {{Z_a} - \mu } \right)}^2}}}{{2{\sigma ^2}}}}},{\rm{\;\;}}{Z_a} = {\left[ {{e_2}\!\left( t \right),{x_2}\!\left( t \right),r\!\left( t \right)} \right]^T}$ , and there exists an upper bound of the estimation error such that $\|{\varepsilon _{{W_a}}}\| \le {\bar \varepsilon _{{W_a}}}$ . Thus, we can obtain that

(16) \begin{align} e_2^T\!\left( t \right)\chi \!\left( {{x_2}\!\left( t \right),\eta \!\left( t \right)} \right) \le e_2^TW_a^T{\Phi _a}\!\left( {{Z_a}} \right) + e_2^T{\varepsilon _{{W_a}}} + \frac{1}{4}e_2^T\!\left( t \right){e_2}\!\left( t \right) + \sum\limits_{i = 1}^3 {{\varepsilon _i}} \end{align}

Then, the matched disturbance ${d_2}\!\left( t \right)$ and actor NN estimation error ${\varepsilon _{{W_a}}}$ of the controlled system need to be considered and compensated. Firstly, the total disturbance $D\!\left( t \right)$ can be constructed in the following form:

(17) \begin{align}D\!\left( t \right) = {d_2}\!\left( t \right) + {\varepsilon _{{W_a}}}\end{align}

Thanks to Assumption 1, the following inequality is satisfied

(18) \begin{align}\!\left| {D\!\left( t \right)} \right| = \left| {{d_2}\!\left( t \right) + {\varepsilon _{{W_a}}}} \right| \le {\varepsilon _v}\end{align}

where ${\varepsilon _v}$ is an unknow positive constant. According to Lemma 2, we can easily obtain

(19) \begin{align}e_2^T\!\left( t \right)D\!\left( t \right) \le \!\left| {{e_2}\!\left( t \right)} \right|{\varepsilon _v} \le {\varepsilon _v}e_2^T\!\left( t \right)\tanh \!\left( {\frac{{{e_2}\!\left( t \right)}}{\alpha }} \right) + \kappa \alpha {\varepsilon _v}\end{align}

Based on the above analysis, the outer loop controller can be designed as

(20) \begin{align}u\!\left( t \right) = {B^{ - 1}}\!\left( \begin{array}{l}- {k_2}{e_2}\!\left( t \right) - {e_1}\!\left( t \right) - {\varphi _\rho }\!\left( {t,{e_2}\!\left( t \right)} \right) - {{\hat \varepsilon }_v}\tanh \!\left( {\dfrac{{{e_2}\!\left( t \right)}}{\alpha }} \right)\\[5pt] - \hat W_a^T{\Phi _a}\!\left( {{Z_a}} \right) - \dfrac{1}{4}{e_2}\!\left( t \right) + {{\dot x}_{2c}}\!\left( t \right)\end{array} \right)\end{align}

where ${k_2}$ is the control gain.

The critic NN which can be used to appraise control performance and make feedback to the actor NN will be introduced in detail. Firstly, we define the integral penalty function of the controlled system as follows

(21) \begin{align} J\!\left( t \right) = \int_0^\infty {\!\left[ {e_1^T\!\left( \tau \right)Q{e_1}\!\left( \tau \right) + {u^T}\!\left( \tau \right)Ru\!\left( \tau \right)} \right]d\tau } \end{align}

Then, we approximate the penalty function $J\!\left( t \right)$ by designing the critic NN

(22) \begin{align}\hat J\!\left( t \right) = {\hat W_{\rm{c}}}^T{\Phi _c}\!\left( {{Z_c}} \right)\end{align}

where ${\hat W_c} \in {{\rm{R}}^{{p_2} \times n}},{{\rm{\Phi }}_c}\!\left( {{Z_c}} \right) \in {{\rm{R}}^{{p_2} \times 1}},{{\rm{\Phi }}_c}\!\left( {{Z_c}} \right) = {e^{ - \frac{{{{\!\left( {{Z_c} - \mu } \right)}^2}}}{{2{\sigma ^2}}}}},{\rm{\;\;}}{Z_c} = {e_1}\!\left( t \right)$ .

Constructing the residual mean square error function of the critic NN structure, one has

(23) \begin{align}\begin{array}{l}{e_c}\!\left( t \right) = e_1^T\!\left( t \right)Q{e_1}\!\left( t \right) + {u^T}\!\left( t \right)Ru\!\left( t \right) + \hat W_c^T\,\nabla {\Phi _c}\,{{\dot x}_1}\!\left( t \right)\\[5pt] {E_c}\!\left( t \right) = \dfrac{1}{2}e_c^T\!\left( t \right){e_c}\!\left( t \right)\end{array}\end{align}

where $\nabla {{\rm{\Phi }}_c} = \partial {{\rm{\Phi }}_c}\!\left( {{x_1}} \right)/\partial {x_1}$ and $\nabla {{\rm{\Phi }}_c} \in {{\rm{R}}^{{p_2} \times n}}$ . The update goal of the weight of the critic NN is to minimise ${E_c}\!\left( t \right)$ , thus the update rate of the critic network weight is obtained according to the gradient descent method

(24) \begin{align} {{\dot {\hat W}}_c} &= - {\Gamma _{W_c}}{\lambda _{W_c}}{e_c} - {\Gamma _{W_c}}{\lambda _{W_c}}{{\hat W}_c}\nonumber\\[5pt] &= - {\Gamma _{{W_c}}}\!\left[ {{\lambda _{{W_c}}}\!\left( {\lambda _{{W_c}}^T{{\hat W}_c} + e_1^T\!\left( t \right)Q{e_1}\!\left( t \right) + {u^T}\!\left( t \right)Ru\!\left( t \right)} \right)} \right] - {\Gamma _{{W_c}}}{\lambda _{{W_c}}}{{\hat W}_c} \end{align}

where ${\lambda _{{W_c}}} = \nabla {{\rm{\Phi }}_c}\dot x\!\left( t \right),{\rm{\;\;}}{{\rm{\Gamma }}_{{W_c}}},{\rm{\;\;}}{\lambda _{{W_c}}}\gt 0$ .

Finally, the adaptive laws of ${\dot{\hat{W}}_a},{\dot{\hat{W}}_c},{\dot {\hat {\varepsilon}}_v}$ are listed as follows

(25) \begin{align} \dot{\hat W}_a &= {\Gamma _{{W_a}}}{\Phi _a}\!\left( {{Z_a}} \right)\!\left( {e_2^T\!\left( t \right) + \hat J\Omega _{}^T} \right) - {\Gamma _{{W_a}}}{\lambda _{{W_a}}}{{\hat W}_a}\nonumber\\[5pt] \dot{\hat W}_c &= - {\Gamma _{{W_c}}}\!\left[ {{\lambda _{{W_c}}}(\lambda _{{W_c}}^T{{\hat W}_c} + e_1^TQ{e_1} + {u^T}Ru)} \right] - {\Gamma _{{W_c}}}{\lambda _{{W_c}}}{{\hat W}_c}\nonumber\\[5pt] \dot {\hat \varepsilon }_v &= e_2^T\!\left( t \right)\tanh \!\left( {\frac{{{e_2}\!\left( t \right)}}{\alpha }} \right) - {\lambda _\varepsilon }{{\hat \varepsilon }_v} \end{align}

For the sake of analysis, we define ${{\tilde{\ast}}} = {{\hat{\ast}}} - {{\ast}}$ to represent the estimation error of the unknown variable ${\rm{*}}$ .

4.0 Stability analysis

Defining ${e_0}\!\left( t \right) = \int_0^t {{e_1}\!\left( s \right)ds}$ and combining Equations (7), (8) and (20), then the closed relation can be obtained

(26) \begin{align}\begin{aligned}{{\dot e}_0}\!\left( t \right) &= {e_1}\!\left( t \right)\\[5pt] {{\dot e}_1}\!\left( t \right) &= - {k_0}{e_0}\!\left( t \right) - {k_1}{e_1}\!\left( t \right) + {e_2}\!\left( t \right) - {{\tilde d}_1}\!\left( t \right)\\[5pt] {{\dot e}_2}\!\left( t \right) &= - {k_2}{e_2}\!\left( t \right) - {e_1}\!\left( t \right) + {d_2}\!\left( t \right) + \chi \!\left( {{x_2}\!\left( t \right),\eta \!\left( t \right)} \right)\\[5pt] &- \hat W_a^T{\Phi _a}\!\left( {{Z_a}} \right) - {{\hat \varepsilon }_v}\tanh \!\left( {\frac{{{e_2}\!\left( t \right)}}{\alpha }} \right) - \frac{1}{4}{e_2}\!\left( t \right) - {\varphi _\rho }\!\left( {t,{e_2}\!\left( t \right)} \right)\end{aligned}\end{align}

The purpose of this paper is to construct an efficient controller to ensure the stability of the closed relation described in Equation (26). The stability of the straight air compound missile system with the proposed control scheme can be revealed by the following theorem.

Theorem 1. Consider the straight air compound missile system described in Equation (3). Suppose Assumption 1 and Assumption 2 can be satisfied. If the inner loop control law and the outer loop control law are given by Equations (8) and (20), the STDO is designed as Equation (9), the adaptive laws are designed as Equation (25), then the closed-loop control system in the existence of unmoldeled dynamics is stable and all the signals are upper bounded.

Proof. The Lyapunov function $V$ is selected as

(27) \begin{align}V &= {V_1} + {V_2}\nonumber\\[5pt] {V_1} &= \frac{1}{2}e_0^T\!\left( t \right){e_0}\!\left( t \right) + \frac{1}{2}e_1^T\!\left( t \right){e_1}\!\left( t \right)\\[5pt] {V_2} &= \frac{1}{2}e_2^T\!\left( t \right){e_2}\!\left( t \right) + \frac{1}{2}\textrm{Tr}\!\left( {\tilde W_a^T\Gamma _{{W_a}}^{ - 1}{{\tilde W}_a}} \right) + \frac{1}{2}\textrm{Tr}\!\left( {\tilde W_c^T\Gamma _{{W_c}}^{ - 1}{{\tilde W}_c}} \right) + \frac{1}{2}{{\tilde \varepsilon }^T}_v{{\tilde \varepsilon }_v} + \frac{{r\!\left( t \right)}}{{{\Gamma _r}}}{\rm{ + }}{J^ * }\!\left( {{x_1}} \right)\nonumber\end{align}

Taking the derivative of both sides of the Equation (27), we can get that

(28) \begin{align} \dot V &= {{\dot V}_1} + {{\dot V}_2}\nonumber\\[5pt] {{\dot V}_1} &= e_0^T\!\left( t \right){{\dot e}_0}\!\left( t \right) + e_1^T\!\left( t \right){{\dot e}_1}\!\left( t \right)\\[5pt] {\dot V}_2 &= e_2^T \!\left( t \right){{\dot e}_2} \!\left( t \right) + \textrm{Tr} \!\left( {{\tilde{W}}_a^T \Gamma _{{W_a}}^{-1} {\dot{\tilde W}}_a} \right) + \textrm{Tr}\!\left( {\tilde W_c^T\Gamma _{{W_c}}^{ - 1}{\dot{\tilde W}}_c} \right) + {{\tilde \varepsilon }^T}_v {\dot {\tilde \varepsilon }_v} - \frac{\gamma _0} {\Gamma_r} r\!\left( t \right){\rm{ + }}\frac{\rho \!\left( t \right)} {\Gamma _r}{\rm{+}} J{_x^ {*T}}{{\dot x}_1}\nonumber \end{align}

Substituting Equation (26) into ${\dot V_1}$ term of Equation (28), one has

(29) \begin{align}{\dot V_1} = e_0^T\!\left( t \right){e_1}\!\left( t \right) - {k_0}e_1^T\!\left( t \right){e_0}\!\left( t \right) - {k_1}e_1^T\!\left( t \right){e_1}\!\left( t \right) + e_1^T\!\left( t \right){e_2}\!\left( t \right) - e_1^T\!\left( t \right){\tilde d_1}\!\left( t \right)\end{align}

Defining ${\bar e_1} = {\left[ {e_0^T\!\left( t \right),e_1^T\!\left( t \right)} \right]^T}$ and utilising the following inequality

(30) \begin{align}e_1^T\!\left( t \right){{\tilde d}_1}\!\left( t \right) &\le \frac{1}{2}e_1^T\!\left( t \right)e_1^{}\!\left( t \right) + \frac{1}{2}\tilde d_1^T\!\left( t \right){{\tilde d}_1}\!\left( t \right)\nonumber\\[5pt] &\le \frac{1}{2}e_1^T\!\left( t \right)e_1^{}\!\left( t \right) + \frac{1}{2}\varepsilon _d^2\end{align}

Equation (29) can be rewritten as

(31) \begin{align}\begin{array}{l}{{\dot V}_1} \le - \bar e_1^T\!\left( t \right)A{{\bar e}_1}\!\left( t \right) + e_1^T\!\left( t \right){e_2}\!\left( t \right) + \dfrac{1}{2}\varepsilon _d^2\\[5pt] A = \left[ {\begin{array}{*{20}{c}} 0&{ - 1}\\[5pt] {{k_0}}&{ - \dfrac{1}{2} + {k_1}} \end{array}} \right]\end{array}\end{align}

where we assume that $\|{\tilde d_1}\!\left( t \right)\| \le {\varepsilon _d}$ holds and ${\varepsilon _d}$ is a positive constant.

Thanks to Equation (16) and ${\dot e_2}\!\left( t \right)$ term of Equation (26), we can get the following inequality

(32) \begin{align} \begin{aligned}e_2^T\!\left( t \right){{\dot e}_2}\!\left( t \right) &\le - {k_2}e_2^T\!\left( t \right){e_2}\!\left( t \right) - e_2^T\!\left( t \right){e_1}\!\left( t \right) - e_2^T\!\left( t \right){\varphi _\rho }\!\left( {t,{e_2}\!\left( t \right)} \right) - e_2^T\!\left( t \right){{\hat \varepsilon }_v}\tanh \!\left( {\frac{{{e_2}\!\left( t \right)}}{\alpha }} \right)\\[5pt] &- e_2^T\!\left( t \right)\tilde W_a^T{\Phi _a}\!\left( {{Z_a}} \right) + e_2^T\!\left( t \right)\!\left( {{d_2}\!\left( t \right) + {\varepsilon _{{W_a}}}} \right) + \sum\limits_{i = 1}^3 {{\varepsilon _i}}\end{aligned} \end{align}

where

(33) \begin{align}&e_2^T\!\left( t \right)\!\left( {{d_2}\!\left( t \right) + {\varepsilon _{{W_a}}}} \right) - e_2^T\!\left( t \right){{\hat \varepsilon }_v}\tanh \!\left( {\frac{{{e_2}\!\left( t \right)}}{\alpha }} \right)\nonumber\\[5pt] &= e_2^T\!\left( t \right)\!\left( {D\!\left( t \right)} \right) - e_2^T\!\left( t \right){{\hat \varepsilon }_v}\tanh \!\left( {\frac{{{e_2}\!\left( t \right)}}{\alpha }} \right)\\[5pt] &\le e_2^T\!\left( t \right){\varepsilon _v}\tanh \!\left( {\frac{{{e_2}\!\left( t \right)}}{\alpha }} \right) + \kappa \alpha {\varepsilon _v} - e_2^T\!\left( t \right){{\hat \varepsilon }_v}\tanh \!\left( {\frac{{{e_2}\!\left( t \right)}}{\alpha }} \right)\nonumber\\[5pt] &= - e_2^T\!\left( t \right){{\tilde \varepsilon }_v}\tanh \!\left( {\frac{{{e_2}\!\left( t \right)}}{\alpha }} \right) + \kappa \alpha {\varepsilon _v}\nonumber\end{align}

Thus, the following inequality can be readily obtained

(34) \begin{align} e_2^T\!\left( t \right){{\dot e}_2}\!\left( t \right) &\le - {k_2}e_2^T\!\left( t \right){e_2}\!\left( t \right) - e_2^T\!\left( t \right){e_1}\!\left( t \right) - e_2^T\!\left( t \right){\varphi _\rho }\!\left( {t,{e_2}\!\left( t \right)} \right)\nonumber\\[5pt] &- e_2^T\!\left( t \right){{\tilde \varepsilon }_v}\tanh \!\left( {\frac{{{e_2}\!\left( t \right)}}{\alpha }} \right) - e_2^T\!\left( t \right)\tilde W_a^T{\Phi _a}\!\left( {{Z_a}} \right) + \sum\limits_{i = 1}^3 {{\varepsilon _i}} + \kappa \alpha {\varepsilon _v} \end{align}

Substituting Equation (34) into ${\dot V_2}$ term of Equation (28), one has

(35) \begin{align}{{\dot V}_2} &\le - {k_2}e_2^T\!\left( t \right){e_2}\!\left( t \right) - e_2^T\!\left( t \right){e_1}\!\left( t \right) - e_2^T\!\left( t \right){\varphi _\rho }\!\left( {t,{e_2}\!\left( t \right)} \right) - e_2^T\!\left( t \right){{\tilde \varepsilon }_v}\tanh \!\left( {\frac{{{e_2}\!\left( t \right)}}{\alpha }} \right) - e_2^T\!\left( t \right)\tilde W_a^T{\Phi _a}\!\left( {{Z_a}} \right)\nonumber\\[5pt] &+ \textrm{Tr}\!\left( {\tilde W_a^T \Gamma _{{W_a}}^{-1} \dot{\tilde W}_a} \right) + \textrm{Tr}\!\left( {\tilde W_c^T\Gamma _{{W_c}}^{ - 1}\dot{\tilde W}_c} \right) + {{\tilde \varepsilon }^T}_v {{\dot {\tilde \varepsilon} }_v} - \frac{{{\gamma _0}}}{{{\Gamma _r}}}r\!\left( t \right){\rm{ + }}\frac{{\rho \!\left( t \right)}}{{{\Gamma _r}}} + J{_x^ {*T}}{{\dot x}_1} + \sum\limits_{i = 1}^3 {{\varepsilon _i}} + \kappa \alpha {\varepsilon _v} \end{align}

For any vector $\xi \in {{\rm{R}}^n}$ , we define

(36) \begin{align}\mathrm{Tanh}\!\left( {\xi \!\left( t \right)} \right) = {\left[ {\tanh {\xi _1}\!\left( t \right),\tanh {\xi _2}\!\left( t \right), \cdots ,\tanh {\xi _n}\!\left( t \right)} \right]^T}\end{align}

Therefore, the following formula holds

(37) \begin{align}&\frac{{\rho \!\left( t \right)}}{{{\Gamma _r}}} = \frac{{\rho \!\left( t \right)}}{{{\Gamma _r}}}\!\left( {1 - 16\textrm{Tanh}^T \!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)\textrm{Tanh}\!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)} \right) + e_2^T\!\left( t \right){\varphi _\rho }\!\left( {t,{e_2}\!\left( t \right)} \right)\nonumber\\[5pt] &{\varphi _\rho }\!\left( {t,{e_2}\!\left( t \right)} \right) = \frac{{16{e_2}\!\left( t \right)\rho \!\left( t \right)}}{{{\Gamma _r}e_2^T\!\left( t \right){e_2}\!\left( t \right)}}\textrm{Tanh}^T \!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)\textrm{Tanh}\!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)\end{align}

Then, combining Equations (31), (35)–(37), we have

(38) \begin{align}\dot V &= {{\dot V}_1} + {{\dot V}_2}\nonumber \\[5pt] &\le - \bar e_1^T\!\left( t \right)A{{\bar e}_1}\!\left( t \right) - {k_2}e_2^T\!\left( t \right){e_2}\!\left( t \right) - e_2^T\!\left( t \right){{\tilde \varepsilon }_v}\tanh \!\left( {\frac{{{e_2}\!\left( t \right)}}{\alpha }} \right) - e_2^T\tilde W_a^T{\Phi _a}\!\left( {{Z_a}} \right)\nonumber {}\\[5pt] &+ \textrm{Tr} \!\left( {\tilde W_a^T\Gamma _{{W_a}}^{ - 1} {{\dot {\tilde W}}_a}} \right) + \textrm{Tr}\!\left( {\tilde W_c^T\Gamma _{{W_c}}^{ - 1} {{\dot {\tilde W}}_c}} \right) + {{\tilde \varepsilon }^T}_v {{\dot{\tilde \varepsilon}}_v} - \frac{{{\gamma _0}}}{{{\Gamma _r}}}r\!\left( t \right) + J{_x^{*T} }{{\dot x}_1}\\[5pt] &{{ + \rho \!\left( t \right)\!\left( {1 - 16\textrm{Tanh}^T \!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)\textrm{Tanh}\!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)} \right)} \mathord{\!\left/ {\vphantom {{ + \rho \!\left( t \right)\!\left( {1 - 16Tan{h^T}\!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)Tanh\!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)} \right)} {{\Gamma _r}}}} \right. } {{\Gamma _r}}} + \frac{1}{2}\varepsilon _d^2 + \sum\limits_{i = 1}^3 {{\varepsilon _i}} + \kappa \alpha {\varepsilon _v}\nonumber \end{align}

By using the adaptive laws in Equation (25) and considering the following inequalities

(39) \begin{align} - {\tilde \varepsilon ^T}_v {\hat \varepsilon _v} \le - \frac{1}{2}{\tilde \varepsilon ^2}_v + \frac{1}{2}\varepsilon _v^2 \end{align}

then, it can be concluded that

(40) \begin{align}\dot V &\le - \bar e_1^T\!\left( t \right)A{{\bar e}_1}\!\left( t \right) - {k_2}e_2^T\!\left( t \right){e_2}\!\left( t \right) - \frac{{{\lambda _\varepsilon }}}{2}{{\tilde \varepsilon }^2}_v + \textrm{Tr}\!\left( {\tilde W_a^T\!\left( {{\Phi _a}\hat J\Omega _{}^T - {\lambda _{{W_a}}}{{\hat W}_a}} \right)} \right)\nonumber \\[5pt] &+ \textrm{Tr}\!\left( {\tilde W_c^T\!\left( { - {\lambda _{{W_c}}}(\lambda _{{W_c}}^T{{\hat W}_c} + {\varepsilon _c}) - {\lambda _{{W_c}}}{{\hat W}_c}} \right)} \right) + J{_x^{*T}}{{\dot x}_1} - \frac{{{\gamma _0}}}{{{\Gamma _r}}}r\!\left( t \right) + \frac{1}{2}\varepsilon _d^2\\[5pt] &+ \sum\limits_{i = 1}^3 {{\varepsilon _i}} + \frac{{{\lambda _\varepsilon }}}{2}\varepsilon _v^2 + {{\rho \!\left( t \right)\!\left( {1 - 16\textrm{Tanh}^T \!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)\textrm{Tanh} \!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)} \right)} \mathord{\!\left/ {\vphantom {{\rho \!\left( t \right)\!\left( {1 - 16Tan{h^T}\!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)Tanh\!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)} \right)} {{\Gamma _r}}}} \right.} {{\Gamma _r}}}\nonumber \end{align}

Considering $Tr\!\left( {\tilde W_a^T\!\left( {{{\rm{\Phi }}_a}\hat J{{\rm{\Omega }}^T} - {\lambda _{{W_a}}}{{\hat W}_a}} \right)} \right)$ term of Equation (40), we can obtain that

(41) \begin{align}&\textrm{Tr}\!\left( {\tilde W_a^T\!\left( {{\Phi _a}\hat J\Omega _{}^T - {\lambda _{{W_a}}}{{\hat W}_a}} \right)} \right)\nonumber \\[5pt] &= \textrm{Tr}\!\left( {\tilde W_a^T{\Phi _a}\tilde W_c^T{\Phi _c}\Omega _{}^T} \right) + \textrm{Tr}\!\left( {\tilde W_a^T{\Phi _a}W_c^T{\Phi _c}\Omega _{}^T} \right) - \textrm{Tr}\!\left( {\tilde W_a^T{\lambda _{{W_a}}}{{\hat W}_a}} \right)\nonumber \\[5pt] &= \tilde W_c^T{\Phi _c}\Omega _{}^T\tilde W_a^T{\Phi _a} + {W_c}^T{\Phi _c}\Omega _{}^T{{\tilde W}_a}^T{\Phi _a} - {\lambda _{{W_a}}}\textrm{Tr}\!\left( {\tilde W_a^T{{\hat W}_a}} \right)\nonumber \\[5pt] &\le {\rho _1}\tilde W_c^T{{\tilde W}_c} + \frac{{{\lambda _{\max }}\!\left( {{\Phi _c}\Omega _{}^T{\Omega _{}}\Phi _c^T} \right)\bar \Phi _a^2}}{{4{\rho _1}}}\textrm{Tr}\!\left ( {\tilde W_a^T{{\tilde W}_a}}\right )\\[5pt] &+ {\rho _2}W_c^T{W_c} + \frac{{{\lambda _{\max }}\!\left( {{\Phi _c}\Omega _{}^T\Omega \Phi _c^T} \right)\bar \Phi _a^2}}{{4{\rho _2}}}\textrm{Tr}\!\left ( {\tilde W_a^T{{\tilde W}_a}}\right ) - \frac{{{\lambda _{{W_a}}}}}{2}\tilde W_a^T{{\tilde W}_a} + \frac{{{\lambda _{{W_a}}}}}{2}W_a^T{W_a}\nonumber \end{align}

Then, considering $\textrm{Tr}\!\left( {\tilde W_c^T\!\left( { - {\lambda _{{W_c}}}\!\left( {\lambda _{{W_c}}^T{{\hat W}_c} + {\varepsilon _c}} \right){ - \lambda_{{W_c}}}{{\hat W}_c}} \right)} \right)$ term of Equations (40), the following inequality holds

(42) \begin{align} &\textrm{Tr}\!\left( {\tilde W_c^T\!\left( { - {\lambda _{{W_c}}}(\lambda _{{W_c}}^T{{\hat W}_c} + {\varepsilon _c}) - {\lambda _{{W_c}}}{{\hat W}_c}} \right)} \right)\nonumber \\[5pt] &= \textrm{Tr}\!\left( { - \tilde W_c^T{\lambda _{{W_c}}}\lambda _{{W_c}}^T{{\tilde W}_c} - \tilde W_c^T{\lambda _{{W_c}}}{\varepsilon _c} - \tilde W_c^T{\lambda _{{W_c}}}{{\hat W}_c}} \right)\nonumber \\[5pt] &= - \tilde W_c^T{\lambda _{{W_c}}}\lambda _{{W_c}}^T{{\tilde W}_c} - \tilde W_c^T{\lambda _{{W_c}}}{\varepsilon _c} - \tilde W_c^T{\lambda _{{W_c}}}{{\hat W}_c}\\[5pt] &\le \!\left( {{\rho _{\rm{3}}} + \frac{{{\lambda _{\max }}\!\left( {{{\bar \lambda }_{{W_c}}}} \right)}}{{4{\rho _{\rm{3}}}}}} \right)\tilde W_c^T{{\tilde W}_c} + {\rho _{\rm{4}}}{\lambda _{\max }}\!\left( {{\lambda _{{W_c}}}\lambda _{{W_c}}^T} \right)\tilde W_c^T{{\tilde W}_c} + \frac{1}{{4{\rho _{\rm{4}}}}}\varepsilon _c^2 - \frac{{{\lambda _c}}}{2}\tilde W_c^T{{\tilde W}_c} + \frac{{{\lambda _c}}}{2}W_c^T{W_c}\nonumber \end{align}

Moreover, the $J_x^{{\rm{*}}T}{\dot x_1}$ term of Equations (28) satisfies

(43) \begin{align}J{_x^ {*T}}{\dot x_1} \le - {\lambda _{\min }}\{ Q\} ||{e_1}|{|^2} - {\lambda _{\min }}\{ R\} ||u|{|^2}\end{align}

Substituting Equations (41)–(43) into Equation (40), we can get that

(44) \begin{align}\dot V &\le - \bar e_1^T\!\left( t \right)A{{\bar e}_1}\!\left( t \right) - {k_2}e_2^T\!\left( t \right){e_2}\!\left( t \right) - \frac{{{\lambda _\varepsilon }}}{2}{{\tilde \varepsilon }^2}_v - \frac{{{\gamma _0}}}{{{\Gamma _r}}}r\!\left( t \right) + {{\rho \!\left( t \right)\!\left( {1 - 16\textrm{Tanh}^T \!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)\textrm{Tanh} \!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)} \right)} \mathord{\!\left/ {\vphantom {{\rho \!\left( t \right)\!\left( {1 - 16\textrm{Tanh}^T \!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)Tanh\!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)} \right)} {{\Gamma _r}}}} \right.} {{\Gamma _r}}}\nonumber \\[5pt] &- \!\left( {\frac{{{\lambda _{{W_a}}}}}{2} - \frac{{{\lambda _{\max }}\!\left( {{\Phi _c}\Omega _{}^T{\Omega _{}}\Phi _c^T} \right)\bar \Phi _a^2}}{{4{\rho _1}}} - \frac{{{\lambda _{\max }}\!\left( {{\Phi _c}\Omega _{}^T\Omega \Phi _c^T} \right)\bar \Phi _a^2}}{{4{\rho _2}}}} \right)\textrm{Tr} \!\left({\tilde W_a^T{{\tilde W}_a}}\right)\nonumber \\[5pt] &- \!\left( {\frac{{{\lambda _{{W_c}}}}}{2} - {\rho _1} - {\rho _{\rm{3}}} - \frac{{{\lambda _{\max }}\!\left( {{{\bar \lambda }_{{W_c}}}} \right)}}{{4{\rho _{\rm{3}}}}} - {\rho _{\rm{4}}}{\lambda _{\max }}\!\left( {{\lambda _{{W_c}}}\lambda _{{W_c}}^T} \right)} \right)\tilde W_c^T{{\tilde W}_c} - {\lambda _{\min }}\{ Q\} ||{e_1}|{|^2} - {\lambda _{\min }}\{ R\} ||u|{|^2}\\[5pt] &+ \frac{{{\lambda _\varepsilon }}}{2}\varepsilon _v^2 + \frac{1}{2}\varepsilon _d^2 + \sum\limits_{i = 1}^3 {{\varepsilon _i}} + \frac{{{\lambda _{{W_a}}}}}{2}W_a^T{W_a} + \frac{{2{\rho _2} + {\lambda _{{W_c}}}}}{2}W_c^T{W_c} + \frac{1}{{4{\rho _{\rm{4}}}}}\varepsilon _c^2\nonumber \end{align}

Defining

(45) \begin{align} &\gamma = \min \!\left\{ \begin{array}{l}2{\lambda _{\min }}\!\left( A \right),2{k_2},2{\lambda _{\min }}\!\left( Q \right),2{\lambda _{\min }}\!\left( R \right),\;{\lambda _\varepsilon },\nonumber \\[5pt] \dfrac{{{\lambda _{{W_a}}}}}{2} - \dfrac{{{\lambda _{\max }}\!\left( {{\Phi _c}\Omega _{}^T{\Omega _{}}\Phi _c^T} \right)\bar \Phi _a^2}}{{4{\rho _1}}} - \dfrac{{{\lambda _{\max }}\!\left( {{\Phi _c}\Omega _{}^T\Omega \Phi _c^T} \right)\bar \Phi _a^2}}{{4{\rho _2}}},\nonumber \\[5pt] \dfrac{{{\lambda _c}}}{2} - {\rho _1} - {\rho _{\rm{3}}} - \dfrac{{{\lambda _{\max }}\!\left( {{{\bar \lambda }_{{W_c}}}} \right)}}{{4{\rho _{\rm{3}}}}} - {\rho _{\rm{4}}}{\lambda _{\max }}\!\left( {{\lambda _{{W_c}}}\lambda _{{W_c}}^T} \right),\end{array} \right\}\nonumber \\[5pt] &{\varepsilon _f} = - \dfrac{{{\gamma _0}}}{{{\Gamma _r}}}r\!\left( t \right) + \dfrac{{{\lambda _\varepsilon }}}{2}\varepsilon _v^2 + \dfrac{1}{2}\varepsilon _d^2 + \sum\limits_{i = 1}^3 {{\varepsilon _i}} + \dfrac{{{\lambda _{{W_a}}}}}{2}W_a^T{W_a} + \dfrac{{2{\rho _2} + {\lambda _{{W_c}}}}}{2}W_c^T{W_c} + \dfrac{1}{{4{\rho _{\rm{4}}}}}\varepsilon _c^2 \end{align}

Thanks to Equation (45), we can obtain that

(46) \begin{align}\dot V \le - \gamma V + {\varepsilon _f}{{ + \rho \!\left( t \right)\!\left( {1 - 16\textrm{Tanh}^T\!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)\textrm{Tanh}\!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)} \right)} \mathord{\!\left/ {\vphantom {{ + \rho \!\left( t \right)\!\left( {1 - 16Tan{h^T}\!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)Tanh\!\left({\frac{{e_2}\!\left( t \right)}{{{\varepsilon _\rho }}}}\right)} \right)} {{\Gamma _r}}}} \right. } {{\Gamma _r}}}\end{align}

According to Equation (46), it can be seen that signals $\left[ {{e_0}\!\left( t \right),{e_1}\!\left( t \right),{e_2}\!\left( t \right),{{\tilde \varepsilon }^{}}_v\!\left( t \right),{{\tilde W}_a}\!\left( t \right),{{\tilde W}_c}\!\left( t \right)} \right]$ are all stable and bounded. Therefore, the stability of the closed-loop system and the boundedness of all signals can be verified. The proof is complete.

5.0 Simulation study

In this section, some numerical simulations are performed to demonstrate the effectiveness and performance of the proposed STDO-based adaptive reinforcement learning (ARL) control method. To show the advantages of the proposed STDO-ARL method, the STDO-ARL without STDO and the STDO-ARL without reinforcement learning (RL) are also considered for comparison, as shown in Figs. 2 and 3. On the other hand, the robustness of the proposed STDO-ARL method is reflected by several external disturbances ${d_1}\!\left( t \right)$ and uncertainties $\chi \!\left( t \right)$ of different degrees imposed on the system as listed in Table 1, and the results are shown in Figs. 7 and 8.

Figure 2. Comparison chart of the tracking performance of $\alpha $ under different methods.

Figure 3. Comparison chart of the tracking performance of $\beta $ under different methods.

Table 1. Model parameter values in different cases

The initial values of the system for simulation are listed as follows: ${x_1} = {\left[ {\begin{array}{c}{0.0675}\;\;\;\; { - 0.5738}\end{array}} \right]^T},{x_2} = {\left[ {\begin{array}{c}0\;\;\;\; 0\end{array}} \right]^T}$ ; the weights of the actor network and the critic network are respectively set as: ${\hat W_a} = \textrm{zeros}\!\left( {22,1} \right),{\hat W_c} = \textrm{zeros}\!\left( {11,1} \right)$ ; the mismatched disturbance of the system is ${\hat d_1} = {\left[ {\begin{array}{c@{\quad}c} 0&0 \end{array}} \right]^T}$ ; what’s more, ${P_2} = {\left[ {\begin{array}{c@{\quad}c}0 & 0\end{array}} \right]^T}$ and ${\hat x_2} = {\left[ {\begin{array}{c@{\quad}c}0&0 \end{array}} \right]^T}$ .

We choose the unmodeled dynamics as $\eta = 1$ and the dynamic auxiliary signal as $r = 2$ . The mismatched disturbance ${d_1}\!\left( t \right)$ in the simulation are set as two different trapezoidal waves: ${d_1}\!\left( t \right) = \left[ {\begin{array}{c}{D\!\left( {t,5,3} \right)}\\[2pt] {D\!\left( {t,5,1} \right)}\end{array}} \right]$ . The uncertainties that are affected by the unmodeled dynamics are supposed to be $\chi \!\left( t \right) = 0.5{x_1}\!\left( t \right){\rm{sin}}\!\left( t \right) + \eta {x_2}\!\left( t \right)$ . In this control method, the system matrix is selected as $B = \left[ {\begin{array}{c@{\quad}c@{\quad}c@{\quad}c}1 &0 &1& 0\\[2pt] 0& 1 &0 & 1\end{array}} \right]$ and other control constants are set as ${{\rm{\Gamma }}_r} = 120,{\rm{\;\;}}{\varepsilon _\rho } = 0.1$ . The control gains are designed as ${k_0} = 3,{\rm{\;\;}}{k_1} = {k_2} = 6$ . The control parameters of STDO disturbance observer are designed as ${K_{{d_1}}} = 5,{\rm{\;\;}}{K_{{P_1}}} = 2,{\rm{\;\;}}{K_{{P_2}}} = 0.1$ . And the adaptive control parameters of reinforcement learning actor network and critic network are set as ${{\rm{\Gamma }}_{{W_a}}} = 0.2,{\rm{\;\;}}{\lambda _{{W_a}}} = 2.5,{\rm{\;\;\Omega }} = {\left[ {\begin{array}{c@{\quad}c}2 & 1\end{array}} \right]^T},{\rm{\;\;}}{{\rm{\Gamma }}_{{W_c}}} = 0.2,{\rm{\;\;}}{\lambda _{{W_c}}} = 2.5,{\rm{\;\;}}Q = diag\!\left( {\left[ {1,{\rm{\;\;}}1} \right]} \right),{\rm{\;\;}}R = diag\!\left( {\left[ {2,{\rm{\;}}1,{\rm{\;}}0,{\rm{\;}}1} \right]} \right)$ . According to the above simulation parameters, the following simulations are carried out: the comparison simulation of different methods and different cases.

5.1 Simulation comparison under different methods

This part is a comparative simulation under different methods. According to Figs. 2 and 3, it is obvious that for the time-varying desired signal, the proposed STDO-ARL control scheme can achieve satisfactory results for the tracking control problems of the straight air compound missile with external disturbances and unmodeled dynamics. While the tracking performance of the proposed method without STDO and the proposed method without RL is not ideal, it may produce undesired tracking errors and cannot ensure the tracking accuracy.

Moreover, according to Figs. 4, 5 and 6, all signals in the closed-loop control system are bounded during the whole control process by using the proposed STDO-ARL method. In summary, the proposed STDO-ARL control method under unmodeled dynamics and disturbances can achieve satisfactory control performance.

Figure 4. The trajectories of the adaptive parameters of the proposed STDO-ARL scheme.

Figure 5. Disturbance estimation effect of $\hat{d}_1 \left(t\right)$ based on STDO.

Figure 6. Variation diagram of the weight norm of the reinforcement learning actor NN $\left\| {{{\hat W}_a}} \right\|$ and the critic NN $\left\| {{{\hat W}_c}} \right\|$ .

5.2 Simulation comparison under different cases

This part is a comparative simulation for the proposed STDO-ARL control scheme under different cases, as shown in Figs. 7 and 8. From the analysis of the simulation results, the proposed STDO-ARL method has the characteristics of high anti-disturbance in dealing with trapezoidal signals and sine-cosine combined signals. However, there will be slight fluctuations when dealing with square-wave signal disturbance. Moreover, for various complex uncertainty conditions, the reinforcement learning structure can be used to fit them effectively. To sum up, the proposed STDO-ARL method has strong robustness and anti-disturbance ability under different cases.

Figure 7. Comparison chart of the tracking performance of $\alpha $ under different cases.

Figure 8. Comparison chart of the tracking performance of $\beta $ under different cases.

6.0 Conclusion

In this paper, an STDO-based adaptive reinforcement learning control scheme is proposed for the straight air compound missile system with unknown aerodynamic uncertainties and unmodeled dynamics. To deal with the tracking problems for the straight gas compound system, adaptive control with actor-critic design has been investigated in this paper. Considering that the negative impacts of the control signal fluctuation caused by the mismatched disturbance of the straight gas compound system, the STDO disturbance observer has been used to solve the problem well. To improve the control performance, reinforcement learning and neural networks have been adopted in the actor-critic design. The simulation results show that the proposed STDO-ARL controller can guarantee the stability of the straight air compound missile system with unknown aerodynamic uncertainties and unmodeled dynamics. What’s more, the effectiveness and robustness of the proposed approach have been illustrated by simulation results. In the future, we will continue to follow up on this problem and consider the reinforcement learning-based anti-coupling control for the straight gas compound system.

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grants No.11772256, Science and Technology on Electromechanical Dynamic Control Laboratory, China, No.6142601190210, the Foundation of National Key Laboratory of Science and Technology on Test Physics & Numerical Mathematics, China, and supported by Research Projects KT-KTYWGL-22-22228.

References

Antonios, T. and Brian, A. Modern missile flight control design: an overview, IFAC Proc. Vol., 2001, 34, (15), pp 425430.Google Scholar
Song, C., Kim, S.J., Kim, S.H. and Nam, H.S. Robust control of the missile attitude based on quaternion feedback, Control Eng. Pract., July 2006, 14, (7), pp 811818.CrossRefGoogle Scholar
Yang, J., Chen, W.H. and Li, S. Non-linear disturbance observer-based robust control for systems with mismatched disturbances/uncertainties, IET Control Theory Appl., December 2011, 5, (18), pp 20532062.CrossRefGoogle Scholar
Lee, Y., Kim, Y. and Moon, G. Sliding-mode-based missile-integrated attitude control schemes considering velocity change, J. Guid. Control Dynam., 2016, 39, (3), pp 423436.CrossRefGoogle Scholar
Zhou, J. and Yang, J. Smooth sliding mode control for missile interception with finite-time convergence, J. Guid. Control Dynam., 2015, 38, (7), pp 13111318.CrossRefGoogle Scholar
Shao, X. and Wang, H. Back-stepping active disturbance rejection control design for integrated missile guidance and control system via reduced-order ESO, ISA Trans., 2015, 57, pp 1022.Google Scholar
Guo, P., Yang, S. and Zhao, L. Second order sliding mode control with back stepping approach for moving mass spinning missiles, J. Beijing Inst. Technol., 2016, 1, pp 1722.Google Scholar
Wang, L., Zhang, W., Wang, D., Peng, K. and Yang, H. Command filtered back-stepping missile integrated guidance and autopilot based on extended state observer, Adv. Mech. Eng., 2017, 9, (11), pp 113.CrossRefGoogle Scholar
Fan, Y., Li, X., Yang, J. and Zhang, Y. Design of autopilot for aerodynamic/reaction-jet multiple control missile using variable structure control, 2008 27th Chinese Control Conference, 2008, pp 642–645.CrossRefGoogle Scholar
Shao, L., Zhang, J. and Cao, Y. Blended robust control method with lateral thrust and aerodynamic force based on robust trail tracking, Aero Weapon, 2016, 291, (1), pp 3539.Google Scholar
Liu, X., Li, A., Guo, Y., Wang, S. and Wang, C. Fixed-time convergence blended control for air-to-air missile with lateral thrusters and aerodynamic force, J. Harbin Inst. Technol., 2019, 51, (09), pp 2934+42.Google Scholar
Zhao, Y., Liao, Z., Duan, C. and Zhang, G. Design of blended lateral thrust and aerodynamic control system based on terminal sliding mode, Navig. Position. Timing, 2015, 2, (03), pp 4954.Google Scholar
Xu, B. and Zhou, D. Backstepping and control allocation for dual aero/propulsive missile control, Syst. Eng. Electron., 2014, 36, (03), pp 527531.Google Scholar
Zhang, X. Design of compound control system with direct lateral thrust and aerodynamics adopting backstepping method, Modern Defence Technol., 2009, 37, (04), pp 4346.Google Scholar
Shi, Z., Ma, W., Zhang, Y. and Lin, Q. Fuzzy control algorithm and realization of compound control missile, Harbin Gongcheng Daxue Xuebao/J. Harbin Eng. Univ., 2014, 35, (02), pp 195201.Google Scholar
Luo, X. and Zhang, T. The application of fuzzy control in combined-guidance, J. Project. Rockets, Miss. Guid., 2001, 02, pp 14.Google Scholar
Liu, S., Qu, X. and Liu, Y. Design of missile autopilot based on fuzzy control, 2016 IEEE International Conference on Information and Automation (ICIA), 2016, pp 13391343.CrossRefGoogle Scholar
Fan, Y. and Yang, J. The design of aerodynamic/reaction-jet compound controller of missile actuator using neural network model reference control, Fire Control Command Control, 2008, 163, (10), pp 8587.Google Scholar
Zhou, X., Peng, M. and Li, Y. Autopilot design for dual aero/propulsive missile using genetic algorithm LQR control, Comput. Meas. Control, 2014, 22, (04), pp 11571159+1162.Google Scholar
Dong, Z., Chen, J., Song, C. and Cao, H. Design of longitudinal control system for target missiles based on fuzzy adaptive PID control, 2017 29th Chinese Control and Decision Conference (CCDC), 2017, pp 398402.Google Scholar
Chwa, D. Fuzzy adaptive disturbance observer-based robust adaptive control for skid-to-turn missiles, IEEE Trans. Aerosp. Electron. Syst., 2015, 51, (01), pp 468478.CrossRefGoogle Scholar
Yang, P., Fang, Y., Chai, D. and Wu, Y. Fuzzy control strategy for hypersonic missile autopilot with blended aero-fin and lateral thrust, Proc. Inst. Mech. Eng. I: J. Syst. Control Eng., 2016, 230, (01), pp 7281.Google Scholar
Cai, J., Xing, L., Zhang, M. and Shen, L. Adaptive neural network control for missile systems with unknown hysteresis input, IEEE Access, 2017, 05, pp 1583915847.CrossRefGoogle Scholar
Chen, K. Full state constrained stochastic adaptive integrated guidance and control for STT missiles with non-affine aerodynamic characteristics, Inform. Sci., 2020, 529, pp 4258.CrossRefGoogle Scholar
Zhang, H., Chen, Z. and Hao, L. Optimization of missile allocation based on adaptive genetic algorithm, Tactical Miss. Technol., 2007, 124, (04), pp 2830+36.Google Scholar
Shi, Z., Ma, W. and Wang, F. Intelligent control algorithm for missile with lateral jets and aerodynamic surfaces, J. Nanjing Univ. Sci. Technol., 2014, 38, (04), pp 481489.Google Scholar
Polycarpou, M.M. and Ioannou, P.A. A robust adaptive nonlinear control design, 1993 American Control Conference, 1993, pp 1365–1369.CrossRefGoogle Scholar
Figure 0

Figure 1. The structure of the proposed reinforcement learning-based STDO control algorithm.

Figure 1

Figure 2. Comparison chart of the tracking performance of $\alpha $ under different methods.

Figure 2

Figure 3. Comparison chart of the tracking performance of $\beta $ under different methods.

Figure 3

Table 1. Model parameter values in different cases

Figure 4

Figure 4. The trajectories of the adaptive parameters of the proposed STDO-ARL scheme.

Figure 5

Figure 5. Disturbance estimation effect of $\hat{d}_1 \left(t\right)$ based on STDO.

Figure 6

Figure 6. Variation diagram of the weight norm of the reinforcement learning actor NN $\left\| {{{\hat W}_a}} \right\|$ and the critic NN $\left\| {{{\hat W}_c}} \right\|$.

Figure 7

Figure 7. Comparison chart of the tracking performance of $\alpha $ under different cases.

Figure 8

Figure 8. Comparison chart of the tracking performance of $\beta $ under different cases.