Hostname: page-component-cd9895bd7-fscjk Total loading time: 0 Render date: 2024-12-29T05:14:27.825Z Has data issue: false hasContentIssue false

A deep reinforcement learning-based approach to onboard trajectory generation for hypersonic vehicles

Published online by Cambridge University Press:  08 February 2023

C.Y. Bao
Affiliation:
College of Aerospace Science and Engineering, National University of Defense Technology, Changsha, 410073, China
X. Zhou
Affiliation:
College of Aerospace Science and Engineering, National University of Defense Technology, Changsha, 410073, China
P. Wang*
Affiliation:
College of Aerospace Science and Engineering, National University of Defense Technology, Changsha, 410073, China
R.Z. He
Affiliation:
College of Aerospace Science and Engineering, National University of Defense Technology, Changsha, 410073, China
G.J. Tang
Affiliation:
College of Aerospace Science and Engineering, National University of Defense Technology, Changsha, 410073, China
*
*Corresponding author. Email: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

An onboard three-dimensional (3D) trajectory generation approach based on the reinforcement learning (RL) algorithm and deep neural network (DNN) is proposed for hypersonic vehicles in glide phase. Multiple trajectory samples are generated offline through the convex optimisation method. The deep learning (DL) is employed to pre-train the DNN for initialising the actor network and accelerating the RL process. Based on the offline deep policy deterministic actor-critic algorithm, a flight target-oriented reward function with path constraints is designed. The actor network is optimised by the end-to-end RL and policy gradients of the critic network until the reward function converges to the maximum. The actor network is considered as the onboard trajectory generator to compute optimal control values online based on the real-time motion states. The simulation results show that the single-step online planning time meets the real-time requirements of onboard trajectory generation. The significant improvement in terminal accuracy of the online trajectory and the better generalisation under biased initial states for hypersonic vehicles in glide phase is observed.

Type
Research Article
Copyright
© The Author(s), 2023. Published by Cambridge University Press on behalf of Royal Aeronautical Society

Nomenclature

3D

Three-dimension

RL

Reinforcement learning

DL

Deep learning

MDP

Markov decision process

DNN

Deep neural network

DPG

Deterministic policy gradient

ODPDAC

Off-Line deep policy deterministic actor-critic

$r$

geocentric distance (m)

$\lambda $

longitude (deg)

$\phi $

latitude (deg)

$V$

velocity (m/s)

$\theta $

flight-path angle (deg)

$\sigma $

heading angle (deg)

$\upsilon $

bank angle (deg)

$g$

Earth’s gravitational acceleration (m/s2)

$\mu $

Earth’s gravitational constant (m/s)

${C_\sigma }$ , ${C_\theta }$

centripetal accelerations

${\tilde C_\sigma }$ , ${\tilde C_\theta }$ , ${\tilde C_V}$

earth rotation accelerations

${\omega _{\rm{e}}}$

angular velocity of Earth’s rotation (rad/s)

$Y$

lift (N)

$D$

drag (N)

$M$

mass of HV (kg)

${S_r}$

reference area (m/s2)

$q$

dynamic pressure (Pa)

$\rho $

atmospheric density (kg/m3)

${C_L}$

lift coefficients

${C_D}$

drag coefficients

$\alpha $

attack angle (deg)

$Ma$

Mach number

$\boldsymbol{{x}}$

motion state vector

$\dot Q$

heat rate (W/m2)

$n$

overload

${k_Q}$

heat rate constant

${{\boldsymbol{{x}}}_0}$

initial states

${{\boldsymbol{{x}}}_{\boldsymbol{{f}}}}$

terminal states

$J$

objective function

${C_1}$ , ${C_2}$

weighting factors

$r_f^{}$

actual terminal geocentric distance (m)

$\lambda _f^{}$

actual terminal longitude (deg)

$\phi _f^{}$

actual terminal latitude (deg)

$r_T^{}$

preset terminal geocentric distance (m)

$\lambda _T^{}$

preset terminal longitude (deg)

$\phi _T^{}$

preset terminal latitude (deg)

${\alpha _{\max }}$

maximum attack angle (deg)

${\alpha _{\max L/D}}$

maximum lift-to-drag ratio attack angle (deg)

${V_1}$ , ${V_2}$

velocity constants (m/s)

$a$

control action

${\upsilon _{\max }}$

maximum amplitude of bank angle (deg)

${\dot \upsilon _{\max }}$

maximum rate of bank angle (deg/s)

S

states space

A

actions space

P

dynamic state transition function

R

reward function

$\gamma $

discount factor

${\mathop{\rm H}\nolimits}\!\left( x \right)$

Heaviside function

$\dot \theta $

change rate of flight-path angle (deg/s)

${S_r}$

normalised constants of flight range (m)

${h_r}$

normalised constants of altitude (m)

$\Delta {h_f}$

terminal altitude error (m)

$\Delta {S_f}$

terminal landing error (m)

${p_1},{p_2},{p_3},{p_4},{p_5}$

constants

${V_t}$

terminal velocity (m/s)

${G_t}$

reward-to-go starting at time

$Q\!\left( {s,a\left| {{\theta _Q}} \right.} \right)$

action-value function

$s'$

next step state

$\mu \!\left( {s\left| {{\theta _\mu }} \right.} \right)$

policy function

${\theta _\mu }$

the network parameters of Actor network

${\theta _Q}$

the network parameters of Critic network

$Q'\!\left( {s,a\left| {{\theta ^{Q^{\prime}}}} \right.} \right)$

target Critic network

$\mu^{\prime}\!\left( {s\left| {{\theta ^{\mu^{\prime}}}} \right.} \right)$

target Actor network

${\theta _{\mu^{\prime}}}$

the network parameters of target Actor network

${\theta _{Q^{\prime}}}$

the network parameters of target Critic network

${L_s}$

mean square error

${N_b}$

number of batch samples

$\delta _{}^{TD}$

temporal difference error

$\Delta \lambda $

per deviation of longitude (deg)

$\Delta \phi $

per deviation of latitude (deg)

$\Delta h$

per deviation of altitude (m)

${s_I}$

actual flight states

$\bar s$

arithmetic mean of the input samples

$\bar \upsilon $

arithmetic mean of the input samples

$\Delta s$

maximum difference of the input samples

$\Delta \upsilon $

maximum difference of the output samples

$L$

loss function

${L_2}$

regularised loss

$\Delta h_0^{}$

initial deviations of altitude (m)

$\Delta \lambda _0^{}$

initial deviations of longitude (deg)

$\Delta \phi _0^{}$

initial deviations of latitude (deg)

$N\!\left( {\mu ,{\sigma ^2}} \right)$

the normal distributed

$\mu $

mean value

$\sigma $

variance

CEP

radius of a circle centred on the target point (m)

1.0 Introduction

The goal of the reentry glide phase of a hypersonic vehicle is to accurately guide the vehicle from its current position to the predetermined terminal area energy management (TAEM) interface with multiple constraints, strong nonlinearities and uncertainty [Reference Bao, Wang and Tang1]. However, it is difficult to achieve due to the flight uncertainty and the external disturbances in glide phase [Reference Bao, Wang and Tang2Reference He, Liu, Tang and Bao4]. Therefore, it is critical to develop a novel efficient, stable and less computationally intensive onboard approach to generate optimal trajectories with greater autonomy and better generalisation. The trajectory generation problem is regarded as an optimal control problem, which can be traditionally solved by indirect and direct methods. The indirect method transforms the trajectory optimisation problem into a Hamiltonian two-point boundary value problem with state variables and covariates [Reference Wei, Han, Pu, Li and Huang5]. However, the strict initial value requirements make the convergence uncertain via numerical iteration. In contrast, the direct method does not require the necessary conditions for the optimal solution but approximates the optimal control problem as a nonlinear programming with parameterisation [Reference Dancila and Botez6]. Common direct solution methods include the pseudo spectral method [Reference Chai, Tsourdos, Savvaris, Chai and Xia7, Reference Rizvi, Linshu, Dajun and Shah8] and the convex optimisation method [Reference Kwon, Jung, Cheon and Bang9Reference Sagliano, Heidecker, Macés Hernández, Farì, Schlotterer, Woicke, Seelbinder and Dumont11].

The development of artificial intelligence algorithms represented by reinforcement learning (RL) and deep learning (DL) has offered new routes to intelligent flight control technologies [Reference Shirobokov, Trofimov and Ovchinnikov12]. Deep neural networks (DNNs) could theoretically approximate any nonlinear system [Reference Schmidhuber13, Reference Basturk and Cetek14], and they are investigated to accurately map the input and output in optimal control and the fundamentals of optimisation models [Reference Nie, Li and Zhang15]. Yang et al. [Reference Shi and Wang16] reported a DL-based method for onboard optimal 2D trajectory generation of hypersonic vehicles, and the DNN outlined the functional relationship between the flight state and the optimal action of the reentry trajectory. Although DNN is an attractive option for onboard trajectory generation, the dependence on samples renders the trajectory generated by DNN impossible to surpass the original samples and get further improvement [Reference Sánchez and Izzo17, Reference Cheng, Wang, Jiang and Li18].

As one of the core techniques of artificial intelligence, RL is considered to be a more attractive option for onboard trajectory generation, and it empowers the agent with self-supervised learning capabilities through trial-and-error mechanisms and exploration-exploitation balance [Reference Tenenbaum, Kemp, Griffiths and Goodman19Reference Han, Zheng, Liu, Wang, Cheng, Fan and Wang21]. Compared to optimal control methods, although RL algorithms have high computational costs during training, it shows much lower computational costs than optimal control when deployed [Reference Shi, Zhao, Wang and Jin22]. Gaudet et al. applied RL to develop a new guidance system [Reference Gaudet, Linares and Furfaro23, Reference Gaudet, Linares and Furfaro24] and an adaptive integrated guidance, navigation, and control system [Reference Gaudet, Linares and Furfaro25] by reinforcement meta-learning on Mars and asteroid landing. Conceptually similar works have also been carried out in Refs (Reference Zavoli and Federici26Reference Xu and Chen29) in designing autonomous closed-loop guidance. Most prior RL-based trajectory planning researches is aimed at spacecraft, while the hypersonic vehicles remain poorly discussed [Reference Zhou, Zhou, Chen and Shi30].

RL essentially solves a sequential decision problem while the onboard trajectory generation requires optimal control commands in real time according to the current state. Thus far, RL is an effective approach to online trajectory generation in principle, but two difficulties are of the utmost importance to be solved:

  1. (1) How to design an appropriate RL algorithm for high-dimensional continuous state space problems such as trajectory generation? Recently investigators have highlighted the effects of deep RL. The DeepMind team implemented the Deep Q-learning algorithm by using DNN approximate action and state value functions to handle RL problems in high-dimensional continuous state space [Reference Mnih, Kavukcuoglu, Silver and Rusu31]. Concerning the RL algorithm, Silver et al. [Reference Silver, Lever and Heess32] demonstrated the deterministic policy gradient (DPG) algorithm and proved that the deterministic policy gradient method performs more effectively than the stochastic policy gradient algorithm in high-dimensional continuous action space. Combined with the advantages of the DNN and policy gradient, an offline deep policy deterministic actor-critic (ODPDAC) algorithm is constructed and discussed in this paper to address the trajectory generation for hypersonic vehicles in glide phase.

  2. (2) How to reasonably initialise the RL actor network to accelerate the learning and converging process? RL can reach the optimal policy through interaction with the environment without any prior information in theory. However, if the initial parameters of the network are quite different from the final optimal parameters, the training difficulty and time will increase significantly or even lead to failure. Consequently, referring to the initialisation method of the first-generation Alpha Go in 2016, a pre-trained network for the subsequent RL actor network initialisation was obtained through supervised learning of human game data [Reference Silver, Huang, Maddison, Guez, Sifre, van den Driessche, Schrittwieser, Antonoglou, Panneershelvam, Lanctot, Dieleman, Grewe, Nham, Kalchbrenner, Sutskever, Lillicrap, Leach, Kavukcuoglu, Graepel and Hassabis33].

Based on the above analysis, this paper proposed a novel onboard trajectory generation approach for hypersonic vehicles with better generalisation and autonomy by RL algorithm as shown in Fig. 1. Initially, a large number of glide trajectory samples are generated based on the convex optimisation method. Then, the trajectory samples are used for supervised learning to pre-train the initial RL actor network. Based on the ODPDAC algorithm, the actor network is end-to-end trained and optimised to directly learn the mapping relationship between the states and optimal action. Furthermore, the resulting RL actor network is tested in gliding flight to realise high-precision online trajectory generation.

Figure 1. Configuration of proposed RL-based trajectory generation method.

The originality of this paper is described in the following three aspects:

  1. (1) It is the first report on the design of a real-time online 3D trajectory generation method for hypersonic vehicles during the gliding phase based upon end-to-end RL.

  2. (2) Initialisation of the RL actor network by DL improves the stability and convergence of training while also promoting the success of policy optimisation.

  3. (3) The 3D trajectory generator optimised by RL reveals strong validity and generalisation for onboard real-time trajectory generation.

The first section describes the trajectory generation model of the reentry glide phase of the hypersonic vehicle. In the next section, the reentry trajectory generation problem and models are described. The following section provides the design, testing, and optimisation of the RL algorithm. Then the simulations are carried out in Section 4 and some discussions are presented in Section 5. Finally, the main work of this paper is summarised in the last section.

2.0 Problem formulation and generation model

2.1 Reentry dynamics

The motion equations of the three-dimensional unpowered flight for hypersonic vehicles over a spherical, rotating Earth are expressed as

(1) \begin{align} \left\{ \begin{array}{l} \dfrac{{\rm{d}}r}{{\rm{d}}t} = V\sin \theta \\[10pt] \dfrac{{\rm{d}}\lambda }{{\rm{d}}t} = - \dfrac{V\cos \theta \sin \sigma }{r\cos \phi } \\[10pt] \dfrac{{\rm{d}}\phi }{{\rm{d}}t} = \dfrac{V\cos \theta \cos \sigma }{r} \\[10pt] \dfrac{{\rm{d}}V}{{\rm{d}}t} = - D - g\sin \theta + {{\tilde C}_V} \\[10pt] \dfrac{{\rm{d}}\theta }{{\rm{d}}t} = \dfrac{Y\cos \upsilon }{V} - \dfrac{g\cos \theta }{V} + \dfrac{V\cos \theta }{r} + {C_\theta } + {{\tilde C}_\theta } \\[10pt] \dfrac{{\rm{d}}\sigma}{{\rm{d}}t} = - \dfrac{Y\sin \upsilon }{V\cos \theta } + \dfrac{V\tan \phi \cos \theta \sin \sigma}{r} + {C_\sigma } + {{\tilde C}_\sigma } \end{array} \right.\end{align}

In Equation (1), $t$ is the time, $r$ is the geocentric distance, $\lambda $ and $\phi $ are the longitude and latitude, respectively. $V$ is the velocity, $\theta $ is the flight-path angle, and $\sigma $ is the heading angle measured clockwise from the north. $\upsilon $ is the bank angle. $g = \mu /{r^2}$ is the Earth’s gravitational acceleration, where $\mu $ is the Earth’s gravitational constant. The Coriolis accelerations ${C_\sigma }$ and ${C_\theta }$ , and the centripetal accelerations ${\tilde C_\sigma }$ , ${\tilde C_\theta }$ and ${\tilde C_V}$ caused by the Earth’s rotation are given by

(2) \begin{align} \left\{ \begin{array}{l} {C_\sigma } = - 2{\omega _{\rm{e}}}(\!\sin \phi - \cos \sigma \tan \theta \cos \phi ) \\[5pt] {{\tilde C}_\sigma } = \dfrac{\omega _{\rm{e}}^2r\cos \phi \sin \phi \sin \sigma }{V\cos \theta } \\[12pt] {C_\theta } = 2{\omega _{\rm{e}}}\sin \sigma \cos \phi \\[5pt] {{\tilde C}_\theta } = \dfrac{\omega _{\rm{e}}^2r}{V}\cos \phi (\!\sin \phi \cos \sigma \sin \theta + \cos \phi \cos \theta ) \\[10pt] {{\tilde C}_V} = \omega _{\rm{e}}^2r({\cos ^2}\phi \sin \theta - \cos \phi \sin \phi \cos \sigma \cos \theta ) \end{array} \right.\end{align}

where ${\omega _{\rm{e}}}$ is the angular velocity of rotation. $Y$ and $D$ are lift and drag acceleration as follows

(3) \begin{align} \left\{ \begin{array}{l} Y = {{q{S_r}{C_Y}} / M} \\[5pt] D = {{q{S_r}{C_D}} / M} \end{array} \right.\end{align}

where $M$ is the mass of a hypersonic vehicle, ${S_r}$ is the reference area. ${C_Y}$ and ${C_D}$ are the lift and drag coefficients respectively, and both are functions of the angle-of-attack $\alpha $ and Mach number $Ma$ . $q$ is the dynamic pressure calculated by $q = 0.5\rho {V^2}$ . The atmospheric density $\rho $ is modeled as

(4) \begin{align} \rho \!\left( r \right){\rm{ }} = {\rho _0}{{\mathop{\rm e}\nolimits} ^{ - h/{h_s}}}\end{align}

where ${\rho _0} = 1.225{\rm{kg/}}{{\rm{m}}^{\rm{3}}}$ , ${h_s}$ is the scale height constant of 7110m, and $h$ is the altitude.

2.2 Reentry trajectory generation models

Equation (1) is expressed as

(5) \begin{align} \dot{\boldsymbol{{x}}} = f\!\left( {{\boldsymbol{{x}}},t} \right)\end{align}

where ${\boldsymbol{{x}}}$ denotes the motion state vector of $[\begin{matrix} r & \lambda & \phi & V & \theta & \sigma \end{matrix} ]^{\rm{T}}$ . The path constraints of heat rate $\dot Q$ , dynamic pressure $q$ , and overload $n$ are considered to prevent aerodynamic thermal ablation and structural damage to the vehicle as follows

(6) \begin{align} \left\{ \begin{array}{c} {\dot Q = {k_Q}\sqrt \rho {V^3} \le {{\dot Q}_{\max }}} \\[5pt] {q \le {q_{\max }}} \\[5pt] {n = Y\cos \alpha + D\sin \alpha \le {n_{\max }}} \end{array} \right.\end{align}

where ${k_{\dot Q}}$ is a heat rate constant related to vehicle shape. The quasi-equilibrium gliding condition is imposed as the soft constraint by approximating the flight-path angle rate to zero with neglecting the Earth’s rotation as

(7) \begin{align} Y\cos \upsilon + {{{V^2}} \over r} - g = 0\end{align}

The boundary constraints include initial and terminal states ${{\boldsymbol{{x}}}_0}$ and ${{\boldsymbol{{x}}}_{\boldsymbol{{f}}}}$ .

(8) \begin{align} \left\{ \begin{array}{l} {{\boldsymbol{{x}}}\!\left( {{t_0}} \right) = {{\boldsymbol{{x}}}_0}} \\[5pt] {{\boldsymbol{{x}}}\!\left( {{t_f}} \right) = {{\boldsymbol{{x}}}_{\boldsymbol{{f}}}}} \end{array} \right.\end{align}

The minimum relative distance between the vehicle target point and the preset target point is expressed with the objective function $J$ as

(9) \begin{align} J = {C_1}\!\left( {\left| {{\lambda _f} - \lambda _T^{}} \right| + \left| {{\phi _f} - \phi _T^{}} \right|} \right){\rm{ + }}{C_2}\left| {{r_f} - r_T^{}} \right|\end{align}

where $r_f^{}$ , $\lambda _f^{}$ and $\phi _f^{}$ are the actual terminal longitude, latitude, and geocentric distance of vehicles and $r_T^{}$ , $\lambda _T^{}$ and $\phi _T^{}$ are the preset target values of every variable, respectively. ${C_1}$ and ${C_2}$ are the weighting factors. A fixed angle-of-attack scheme with a piecewise linear function is given by

(10) \begin{align} \alpha = \left\{ \begin{array}{l@{\quad}l} {\alpha _{\max }} & v \gt {V_1} \\[5pt] {\alpha _{\max }} + {{\left( {{\alpha _{\max Y/D}} - {\alpha _{\max }}} \right) \!\cdot \!\left( {V - {V_1}} \right)} / V} & {V_2} \le v \le {V_1}\\[5pt] {\alpha _{\max Y/D}} & v \lt {V_2} \end{array} \right.\end{align}

where ${\alpha _{\max }}$ and ${\alpha _{\max Y/D}}$ are the maximum angle-of-attack and maximum lift-to-drag ratio, respectively. ${V_1} = 5000 $ and ${V_2} = 3500$ are segmentation velocities. The bank angle is the only control variable with magnitude and rate constraints as follows

(11) \begin{align} \left\{ \begin{array}{l} {\upsilon _{\min }} \le |\upsilon | \le {\upsilon _{\max }} \\[5pt] |\dot \upsilon | \le {{\dot \upsilon }_{\max }} \end{array} \right.\end{align}

To sum up, the hypersonic vehicle glide section trajectory generation problem may be summarised as:

(12) \begin{align} &{P_0}\;:\min \; J \nonumber \\[5pt] &{s.t.}\quad {\dot{\boldsymbol{{x}}} = f\!\left( {{\boldsymbol{{x}}},t} \right)} \nonumber\\[5pt] & \qquad {{\upsilon _{\min }} \le |\upsilon | \le {\upsilon _{\max }}} \\[5pt] & \qquad {\left| {\dot \upsilon } \right| \le {{\dot \upsilon }_{\max }}} \nonumber\\[5pt] & \qquad {\dot Q \le {{\dot Q}_{\max }},q \le {q_{\max }},n \le {n_{\max }}} \nonumber\\[5pt] & \qquad {{\boldsymbol{{x}}}\!\left( {{t_0}} \right) = {{\boldsymbol{{x}}}_0},{\boldsymbol{{x}}}\!\left( {{t_f}} \right) = {{\boldsymbol{{x}}}_{\boldsymbol{{f}}}}} \nonumber \end{align}

3.0 RL-based trajectory generation method

3.1 Markov decision process (MDP) modeling

The mathematical model of RL is usually described by MDP as composed of five elements (S, A, P, R, $\gamma $ ), where S and A are the state space and action space of the agent, respectively. P is the environmental dynamic transfer function and R is the reward function. $\gamma $ is the discount factor. The agent is the system composed of the hypersonic vehicle, action policy $\mu $ , and action-value function $Q$ . The environment is the system dynamics. The motion state vector in Equation (1) is selected as state S

(13) \begin{align} {s_I} = \left[ \begin{array}{c@{\quad}c@{\quad}c@{\quad}c@{\quad}c@{\quad}c} r & \lambda & \phi & V & \theta & \sigma \end{array} \right]^T\end{align}

and bank angle $\upsilon $ is the action A as

(14) \begin{align} a = \upsilon \end{align}

P is 1 as determined by the physical system. To achieve the minimum terminal position error of the trajectory generation with complex constraints in the glide phase, the reward function is set as

(15) \begin{align} R = \left\{ \begin{array}{l} - {p_1}\left[ {{\mathop{\rm H}\nolimits} (n_y^{} - n_{y\max }^{}) + {\mathop{\rm H}\nolimits} (\dot Q - \dot Q_{\max }^{}) + {\mathop{\rm H}\nolimits} (q - q_{\max }^{})} \right] - {p_2}\left| {\dot \theta } \right|,\quad \textrm{if}\; V \gt {V_t}\, \\[5pt] {p_3} \cdot \left[ {{p_4}{\mathop{\rm e}\nolimits} _{}^{ - \Delta {S_f}/{S_r}} + {p_5}{\mathop{\rm e}\nolimits} _{}^{ - \Delta {h_f}/{h_r}}} \right]\; \textrm{if}\; V = {V_t}, \end{array} \right.\end{align}

where ${p_i}\!\left( {i = 1,2, \cdots 5} \right)$ are constants. ${\mathop{\rm H}\nolimits} \!\left( x \right)$ is the Heaviside step function as follow

(16) \begin{align} \textrm{H} \!\left( x \right) = \left\{ \begin{array}{c@{\quad}c} 0 & x \lt 0 \\[5pt] 0.5 & x = 0 \\[5pt] 1 & x \gt 0 \end{array} \right. \end{align}

In Equation (15), $\dot \theta $ is the flight-path angle rate and the larger $\left| {\dot \theta } \right|$ denotes more intense altitude oscillations and larger punishment (negative reward). $\Delta {h_f}$ is the terminal altitude error defined by $\Delta {h_f} = \left| {{h_f} - {h_T}} \right|$ . $\Delta {S_f}$ is the terminal position error defined by [Reference Zhou, Zhang, Xie, Tang and Bao34]

(17) \begin{align} \Delta {S_f} = {r_f}{\rm{acos}}\!\left( {{\rm{sin}}{\phi _T}{\rm{sin}}{\phi _f} + {\rm{cos}}{\phi _T}{\rm{cos}}{\phi _f}{\rm{cos}}\!\left( {{\lambda _T} - {\lambda _f}} \right)} \right)\end{align}

where ${r_f}$ is the terminal geocentric distance, ${S_r}$ and ${h_r}$ are the normalisation constants in the reward function. ${V_t}$ is the terminal velocity. According to Equation (15), if $V \gt {V_t}$ , when the motion states exceed any path constraints in terms of heat rate, dynamic pressure, or overload, a punishment $ - {p_1}$ will be assigned. When it does not exceed any constraint, a path punishment $ - {p_2}\left| {\dot \theta } \right|$ will be assigned according to the severity of the oscillation. When $V = {V_t}$ , trajectory generation stops and a terminal reward is determined by the terminal position and altitude error. Equation (15) illustrates that smaller terminal position and altitude errors lead to greater rewards. The desired trajectory, guided by the reward function in Equation (15) is the gliding trajectory which satisfies the path constraints with the smallest terminal position and altitude errors. The discount factor $\gamma $ is a real value $ \in \left[ {0,{\rm{ }}1} \right]$ for the rewards achieved by the agent in the past, present and future. If $\gamma = 0$ , the agent cares for immediate reward only. If $\gamma = 1$ , the agent cares about all future rewards. Ideally, the agent should consider as many steps’ rewards as possible, but too large $\gamma $ may lead to a difficult convergence. The design principle of the discount factor $\gamma $ is to be as large as possible with allowing it to converge. Considering the large decision step number of gliding trajectory generation, $\gamma $ should be relatively larger to count in more decision steps. In summary, the MDP model of gliding trajectory generation for hypersonic vehicles is described as

(18) \begin{align} \left\{ \begin{array}{l} {S = {{\left[ \begin{matrix} r\;\;\; & \lambda\;\;\; & \phi\;\;\; & V\;\;\; & \theta\;\;\; & \sigma \end{matrix} \right]}^T}} \\[5pt] {A = \upsilon } \\[5pt] {P = 1} \\ {R = \left\{ \begin{array}{l} - {p_1}\left[ {{\mathop{\rm H}\nolimits} (n_y^{} - n_{y\max }^{}) + {\mathop{\rm H}\nolimits} (\dot Q - \dot Q_{\max }^{}) + {\mathop{\rm H}\nolimits} (q - q_{\max }^{})} \right] - {p_2}\left| {\dot \theta } \right|,\, \, \, \, \, \, \, \, \, \, \, \textrm{if}\, \, \, \, V \gt {V_t}\, \\[5pt] {p_3} \cdot \left[ {{p_4}{\mathop{\rm e}\nolimits} _{}^{ - \Delta {S_f}/{S_r}} + {p_5}{\mathop{\rm e}\nolimits} _{}^{ - \Delta {h_f}/{h_r}}} \right]\, ,\, \, \, \, \, \, \, \, \, \, \, \textrm{if}\, \, \, \, V = {V_t}, \end{array} \right.} \\[5pt] \gamma \end{array} \right.\end{align}

3.2 RL algorithm for trajectory generation

3.2.1 RL overview

During RL, the agent’s goal is to learn a policy that maximises its total expected rewards, which is defined as [Reference Tenenbaum, Kemp, Griffiths and Goodman19]

(19) \begin{align} {G_t} = \sum\limits_{k = 0}^\infty {{\gamma ^k}{R_{t + k + 1}}} \end{align}

where ${G_t}$ is the reward-to-go starting at the time $t$ . The action-value function $Q$ is defined as [Reference Tsitsiklis20]

(20) \begin{align} {Q_\mu }\!\left( {s,a} \right) = {E_\mu }\left[ {{G_t}\left| {s,a} \right.} \right]\end{align}

The DPG algorithm has been developed for a high-dimensional continuous space problem such as trajectory generation. DPG maps the state $s$ to a deterministic action $a$ by expressing the policy as a policy function ${\mu _w}\!\left( s \right)$ ( $w$ is the policy parameter). When the policy ${\mu _w}\!\left( s \right)$ is deterministic, the Bellman Equation is used to calculate the action-value function $Q$ as [Reference Tsitsiklis20]

(21) \begin{align} {Q_\mu }\!\left( {s,a} \right) = {E_{s' \sim E}}\left[ {R\!\left( {s,a} \right) + \gamma {Q_\mu }\!\left( {s',\mu \!\left( {s'} \right)} \right)} \right]\end{align}

where $s'$ is the next state.

Theorem. DPG theorem [Reference Silver, Lever and Heess32].

In an MDP model, assume that $\varpi \!\left( s \right)$ , $P\!\left( {s'|s,a} \right)$ , ${\nabla _a}P\!\left( {s'|s,a} \right)$ , ${\mu _w}\!\left( s \right)$ , ${\nabla _w}{\mu _w}\!\left( s \right)$ , $R\!\left( {s,a} \right)$ , ${\nabla _a}R\!\left( {s,a} \right)$ , $\rho \!\left( s \right)$ exist, and they are continuous functions for $s$ , $s'$ , $a$ , $w$ , (where $\rho \!\left( s \right)$ denotes the initial state probability distribution and $P\!\left( {s'|s,a} \right)$ denotes the state transfer probability. The above conditions are to ensure that ${\nabla _w}{\mu _w}\!\left( s \right)$ and ${\nabla _a}Q\!\left( {s,a} \right)$ exist), then the DPG must exist as

(22) \begin{align} {\nabla _w}J\!\left( {{\mu _w}} \right) &= \int_s {{\varpi _\mu }\!\left( s \right)} {\nabla _w}{\mu _w}\!\left( s \right){\nabla _a}Q\!\left( {s,a} \right)\left| {_{a = {\mu _w}\!\left( s \right)}} \right.{\mathop{\rm ds}\nolimits} \nonumber \\[5pt] & = {E_{s \sim {\varpi _\mu }}}\left[ {{\nabla _w}{\mu _w}\!\left( s \right){\nabla _a}Q\!\left( {s,a} \right)\left| {_{a = {\mu _w}\!\left( s \right)}} \right.} \right] \end{align}

where $J$ is the objective function of RL. Aiming to explore other state spaces during the RL process, an offline policy gradient method is implemented. It is indicated that the action policy is stochastic, while the evaluation policy is deterministic. DPG algorithm follows the actor-critic learning framework where the actor outputs policy $\mu $ with the input of state $s$ based on the deterministic policy and the critic evaluates $Q$ . The critic uses a differentiable approximation function to estimate $Q$ and the actor updates the policy parameters $w$ along the gradient $Q$ .

3.2.2 ODPDAC algorithm

Considering the advantages in high-dimensional, continuous nonlinear space, DNN is devoted to approximating the action-value function (critic network) and policy network (actor network) for end-to-end learning. Then ${w^\mu }$ and ${w^Q}$ denote the parameters of the actor network $\mu $ and critic network $Q$ . Based on Equation (22), the optimal policy function is iteratively solved by [Reference Silver, Lever and Heess32]

(23) \begin{align} {\nabla _{{w^\mu }}} & \mu \approx {E_\mu }\left[ {{\nabla _{{w^\mu }}}Q\!\left( {s,a\left| {{w^Q}} \right.} \right)\left| {_{s,a = \mu \!\left( {s\left| {{w^\mu }} \right.} \right)}} \right.} \right] \nonumber \\[5pt] & \quad\!\! = {E_\mu }\left[ {{\nabla _a}Q\!\left( {s,a\left| {{w^Q}} \right.} \right)\left| {_{s,a = \mu \!\left( s \right)}} \right.{\nabla _{{w^\mu }}}\mu \!\left( {s\left| {{w^\mu }} \right.} \right)\left| s \right.} \right] \end{align}

However, due to the Markovian property of RL data, the training of the networks is unstable because the prerequisite assumptions of independent and homogeneous distribution of samples are not supported. Experience replay is introduced to store the data accumulated by the agent while continuously interacting with the environment, in memory B . The data is stored in units of time steps, such as $\left( {{s_i},{a_i},{R_i},{{s'}_i}} \right)$ . To update the parameters of the neural network, the data is extracted from B by uniform random sampling. As the samples are randomly selected, experience replay is implemented to break the correlation between the data for stability and convergence during training. Combining the above analysis, ODPDAC is shown in Fig. 2.

Figure 2. Flow chart of ODPDAC.

The critic network is updated by minimising the loss function as

(24) \begin{align} L = {1 \over {{N_b}}}\sum\limits_{i = 1}^N {\delta _{TDi}^2} \end{align}

where ${N_b}$ is the batch size, $\delta _{TD}^{}$ denotes time difference error combined with Equation (21) as [Reference Zavoli and Federici26]

(25) \begin{align} \delta _{TD}^{} = R + \gamma Q\!\left( {s',\mu \!\left( {s'\left| {{w^\mu }} \right.} \right)\left| {{w^Q}} \right.} \right) - Q\!\left( {s,a\left| {{w^Q}} \right.} \right)\end{align}

The actor network optimises and updates the network parameters according to the policy gradient ascent method as

(26) \begin{align} \Delta {w^\mu } = {\nabla _a}Q\!\left( {s,a\left| {{w^Q}} \right.} \right) \cdot {\nabla _{{w^\mu }}}\mu \!\left( {s\left| {{w^\mu }} \right.} \right)\end{align}

where ${\nabla _a}Q$ is the gradient of actor-value function $Q$ relative to action $a$ in critic network, ${\nabla _{{w_\mu }}}\mu \!\left( {s\left| {{w_\mu }} \right.} \right)$ is the gradient of policy function $\mu $ output by actor network relative to network parameters ${w^\mu }$ . Both are expressed as

(27) \begin{align} {\nabla _a}Q\!\left( {s,a\left| {{w_Q}} \right.} \right) = {{\partial Q\!\left( {s,\mu \!\left( {s\left| {{w^\mu }} \right.} \right)\left| {{w^Q}} \right.} \right)} \mathord{\left/ {\vphantom {{\partial Q\!\left( {s,\mu \!\left( {s\left| {{w^\mu }} \right.} \right)\left| {{w^Q}} \right.} \right)} {\partial a}}} \right.} {\partial a}}\end{align}
(28) \begin{align} {\nabla _{{w_\mu }}}\mu \!\left( {s\left| {{w_\mu }} \right.} \right) = {{\partial \mu \!\left( {s\left| {{w^\mu }} \right.} \right)} \mathord{\left/ {\vphantom {{\partial \mu \!\left( {s\left| {{w^\mu }} \right.} \right)} {\partial {w^\mu }}}} \right.} {\partial {w^\mu }}}\end{align}

The algorithm is given as follows.

It is noted that the actor network is initialised by the DL pre-trained DNN obtained by supervised learning in Section 4.1.2 to improve the learning efficiency and convergence. Then the continuous interaction training between the hypersonic vehicle and the environment is carried out by ODPDAC to achieve the optimisation of the actor network.

4.0 Tests and analysis

4.1 DL pre-training

4.1.1 Multiple trajectories generation and sample processing

The hypersonic vehicle model for simulation is based on the published high-performance Common Aero Vehicle (CAV-H) model [Reference Phillips35], which has a maximum lift-to-drag ratio of 3.5 at an angle-of-attack of 10 degrees. The reference area of CAV-H is $0.48{{\rm{m}}^2}$ , and the mass is 907 kg. The convex optimisation method referred to in Ref. (Reference Zhou, He, Zhang, Tang and Bao36) is implemented to generate the standard trajectories offline. The initial states, terminal states and path constraints are shown in Table 1.

Table 1. Values of initial states, terminal states and constraints

Using CVX software on an Intel(R) Core(TM)I7-9700 [email protected] with 16GB RAM and NVidia GeForce RTX 2060 super GPU, multiple trajectories are generated with different initial position deviations as follows

(29) \begin{align} \left\{ \begin{array}{l@{\quad}l@{\quad}l} \lambda _i^{} = \lambda _0^{} + i\Delta \lambda , & i = - 2, - 1,0,1,2, & \Delta \lambda = -0.25^{\circ} \\[5pt] \phi _j^{} = \phi _0^{} + j\Delta \phi , & j = - 2, - 1,0,1,2, & \Delta \phi = - \, {0.25^ \circ } \\[5pt] h_k^{} = h_0^{} + k\Delta h, & k = - 2, - 1,0,1,2, & \Delta h = 500{\rm{m}} \end{array} \right.\end{align}

where $\lambda _i^{},\phi _j^{}$ and $h_k^{}$ are longitude, latitude and altitude respectively. $i,j$ and $k$ are corresponding deviation times for $\lambda _i^{},\phi _j^{}$ and $h_k^{}$ , respectively. $\Delta \lambda $ , $\Delta \phi $ and $\Delta h$ are corresponding per deviation. There are 125 initial positions consisting of three variables $\left( {\lambda _i^{},\phi _j^{},h_k^{}} \right)$ obtained by combinations of different deviation times $\left( {i,j,k} \right)$ . Then, the convex optimisation method is used to solve Equation (12) to collect the control input of the bank angle. Velocity is set to be the trajectory termination condition. Substituting the bank angle commands into the equation for numerical integration generated 125 reentry trajectories. All of them begin at different spatial locations, enveloping a certain area in the longitudinal and latitudinal planes, and finally arrive at the preset target location. The average time required to compute a complete trajectory is 20.83s. These 125 trajectories do not exceed the path constraint limits of the heat rate, dynamic pressure, and overload. The 5,051,254 sequence pairs of the motion state-control input are sampled on the trajectories with an interval time of 0.5s and permuted for subsequent DL training. The samples are normalised to make them dimensionless with the same magnitude to promote the efficiency and accuracy of DL by

(30) \begin{align} \left\{ \begin{array}{l} s = \dfrac{{s_I} - {{\bar{s}}_I}}{\Delta {s_I}}, \Delta s = \max \!\left( {{s_I}} \right) - \min ({s_I}) \\[10pt] a = \dfrac{\upsilon - \bar \upsilon }{\Delta \upsilon }, \Delta \upsilon = \max \!\left( \upsilon \right) - \min (\upsilon ) \end{array} \right.\end{align}

where $s$ is the motion state vector for DNN’s input, $a$ is the DNN output. ${s_I}$ and $\upsilon $ are state vector and bank angle of samples respectively. $\bar s$ and $\bar \upsilon $ represent the mean of the input and output samples while $\Delta s$ and $\Delta \upsilon $ representing the maximum difference between input and output samples. According to the 90/10 standard involving 10-fold cross-validation, the normalised sample data is divided into training sets with 4,546,128 sequence pairs and a test set with 502,126 sequence pairs.

4.1.2 DNN structure and hyperparameters

With the samples collected above, DL is employed to train a DNN by learning a nonlinear mapping between the input of the motion state and the output of the control variable. To approximate the nonlinearity between the input and output, hidden layers 2 through 5 use the ReLU function, while the hidden layer 1 and the output layer use the tanh function to realise the nonlinear transformation. After comparative tests, the DNN structure and hyperparameters for trajectory generation are briefly summarised in Table 2.

Table 2. Design of DNN structure and hyperparameters

For training and testing the DNN on the samples while avoiding overfitting, the loss function is defined as

(31) \begin{align} L = {1 \over m}\sum\limits_{i = 1}^m {{{\left( {{\upsilon _i} - {{\hat \upsilon }_i}} \right)}^2}} + {L_2}\end{align}
(32) \begin{align} {L_2} = {\lambda \over {2n}}\sum\limits_w {{w^2}} \end{align}

In Equation (31), $m$ is the number of training samples, $\hat \upsilon $ and $\upsilon $ denote the label and network outputs, respectively. The first item on the right side of Equation (31) is the mean square error and ${L_2}$ is the regularised loss as in Equation (32), where n is the number of weights, $w$ represents weight coefficients, and $\lambda $ is the regularisation coefficient. The regularisation ${L_2}$ is set to constrain the weight coefficients of the network to reduce the complexity of the model and improve the generalisation ability. The learning rate and regularisation coefficient $\lambda $ are set to 0.005 and 0.0015 respectively in DL training using Python 3.7.

4.1.3 Results of DL

After 1,000 epochs, the variation of the loss function is displayed in Fig. 3 as follows.

Figure 3. Variation of the loss function during the training process.

It is inferred from Fig. 3 that the loss function converges to a small value in the DL process and fulfills the error precision requirement at the end of training. To further confirm the DL training results, the trained DNN is embedded in the simulation of the trajectory generation as shown in Fig. 4.

Figure 4. Configuration of real-time trajectory generation by DNN.

As can be seen from Fig. 4, the state variables are input to the DL pre-trained DNN, which outputs the bank angle to generate the trajectory in real-time. The simulation conditions are the same for Section 4.1.1.

From Table 3, it can be inferred that the trajectory generated by the DL pre-trained DNN is within the path and terminal constraints. Due to the difficulty in determining the optimised learning parameters for DL, the unexpected overfitting or underfitting result in a terminal position error of 106.27 km. Since the trajectory could be reliably generated, the pre-trained DNN obtained by DL is appropriate for the initialisation of the subsequent RL actor network.

Table 3. Terminal error and maximum constraints

4.2 Tests of RL

4.2.1 Network structures and syperparameters

As shown in Fig. 2, ODPDAC contains two DNNs: actor and critic. The structure and hyperparameters of the actor network are consistent with the DL pre-trained DNN in Section 4.1.2 to facilitate the initialisation of the network parameters. The settings of the critic network are referred to as the network. Through repeated attempts and comparisons, the structure and hyperparameters of the actor and critic networks are presented in Table 4. The values of ${p_i}\!\left( {i = 1,2, \cdots ,5} \right)$ are set as 100, 100, 30, 1.5, and 0.5, respectively. The values of $\Delta {S_f}$ and ${h_r}$ are ${1^ \circ }$ and $100{\rm{m}}$ , respectively. The learning rate and the total number of training episodes are $2.5 \times {10^{ - 5}}$ and 3,500, respectively.

Table 4. Structure parameter of A-C networks

4.2.2 Training results

To verify the performance of the DL pre-training for RL, two cases of RL with pre-training and RL without pre-training are set up for comparative analysis. After 3,000 episodes of RL training, the variations of the terminal reward and terminal position altitude errors of the trajectories with training episodes are shown in Fig. 5.

Figure 5. Variation of terminal variables in RL.

In Fig. 5, during the training process of RL with DL pre-training, the reward of RL grows gradually with the number of training episodes. In the initial phase, the exploration value is large and the optimisation of the policy is insufficient. By continuous learning, the reward and actor network is optimised rapidly, within the first thousand training episodes. After 2,500 episodes, the reward value stably converges to a larger value and finally reaches the maximum of 46.63. The terminal position and altitude errors both converge to smaller values of 19.14km and 44.2m, respectively. However, the RL without DL pre-training failed with unacceptable terminal position and altitude errors. It is extremely difficult to learn a feasible solution in a short time for large-scale continuous space problems such as glide trajectory generation. Therefore, the achievable but not optimal initialisation from pre-training can effectively improve the stability and accelerate the convergence of RL.

4.2.3 Effectiveness of the actor network

To verify the effectiveness of the actor network optimised by RL training, it is substituted into the dynamics model of Equation (1) as shown in Fig. 4 to implement the numerical simulation of real-time trajectory generation. The integral time step of the 6-DOF dynamic simulation is set to 0.1s. The simulation conditions are the same as those in Section 4.1.2 and the test results are presented in Fig. 6.

Figure 6. Resulting curves of trajectories by RL policy network.

From Fig. 6(a) and (b), the RL actor network completes real-time trajectory generation in the glide phase. The average computing time is only 0.4329ms within an integral time step of 0.1s. The network output can be solved quickly by simple vector/matrix multiplication of the input state with the parameters in each hidden layer. Therefore, the online planning time of the RL policy network satisfies the onboard planning time requirement of a hypersonic vehicle. Then, the total time required to compute a complete trajectory is 9.28s which is less than the 20.83s required by the convex optimisation method. It indicates the very low online computational burden of policy networks with the DNN model. The hypersonic vehicle reaches the preset target position and altitude within terminal boundary constraints. Figure 6(c) shows that the velocity of the vehicle decreases gently to terminal velocity, without sharp changes or oscillation. The flight-path angle is kept near zero except in the initial descent phase where the values change drastically owing to the lack of lift. The heading angle changes slowly throughout the flight without oscillation, and the direction does not vary frequently. Figure 6(d) presents that the heat rate, dynamic pressure, and overload are kept below the maximum constraint during the whole flight. As shown in Fig. 6(e), the change of the bank angle output by the RL actor network is improved to enhance the effectiveness and reward of trajectory generation. The terminal errors and maximum constraint values in the glide phase with the RL actor network are reported in Table 5.

Table 5. Terminal errors and maximum constraints

It is apparent from Table 5 that the trajectory meets the path constraints in the whole process and the terminal position error is only 19.57km which is significantly reduced compared with the 106.27km shown in Table 3. The results prove the effectiveness of the trajectory generator based on ODPDAC. The actor network is optimised according to the reward orientation which considerably improves the performance of trajectory generation. Moreover, the time required by the actor network during each simulation step is so short that the generation of the control variable is near real-time.

4.2.4 Generalisation of actor networks

(1) Monte Carlo experiment

The generalisation of the RL actor network for online trajectory generation is tested under 500 initial position deviations randomly generated based on the Monte Carlo sampling principle. The initial deviations of altitude $\Delta h_0^{}\!\left( {\rm{m}} \right)$ , longitude $\Delta \lambda _0^{}\!\left( {{\rm{deg}}} \right)$ , and latitude $\Delta \phi _0^{}\!\left( {{\rm{deg}}} \right)$ are given as follows.

(33) \begin{align} \left\{ \begin{array}{l} \Delta h_0 = N\!\left(0,{{{1000}^2}} \right) \\[5pt] \Delta \lambda _0 = N\!\left({0,}\,{{{1.0}^2}}\right) \\[5pt] \Delta \phi _0^{} = N\!\left({0,}\,{{{1.0}^2}}\right) \end{array} \right.\end{align}

In Equation (33), $N\!\left( {\mu ,{\sigma ^2}} \right)$ denotes the normal distribution where $\mu $ is the mean value and ${\sigma ^2}$ is the variance. To compare the effect of RL, the DL pre-trained DNN obtained in Section 4.1 is also tested using the same Monte Carlo experiments and the final results of the two networks compared in Fig. 7.

Figure 7. Comparison of Monte Carlo experiments results.

As shown in Fig. 7(a), the RL actor network generated online trajectories in the glide phase and reaches the preset position and altitude range despite initial positional uncertainty. In Fig. 7(b), the velocity of all trajectories decreases slowly without sharp change and vibration. Figure 7(c) shows that heat rate, dynamic pressure, and overload of all trajectories are kept within the maximum constraints. Figure 7(d) shows that the bank angle presents a large difference to adapt to online trajectory generation with different initial positional uncertainty. According to Fig. 7(e) to (h), in the case of initial position uncertainty, the remaining trajectories complete the online trajectory generation during the flight process, but the distribution of the process state and terminal position vary widely.

(2) Comparative analysis

To further compare the generalisation of the RL actor network and the pre-training of DNN, the terminal position errors, terminal altitude errors, and rewards of two groups of trajectories are illustrated in Fig. 8 and Table 6. In Fig. 8(a), CEP is the radius of a circle centred on the target point. In multiple Monte Carlo experiments, the vehicle has a 50% probability of falling within this CEP circle [Reference Bao, Wang and Tang2].

Figure 8. Statistical charts of Monte Carlo experiments results.

Table 6. Statistical results of RL and DL experiments

Figure 8(a) depicts that the CEP of RL is lower than half of DL which demonstrates the high landing accuracy of the RL actor network under large initial state uncertainties. Meanwhile, the mean value of the terminal reward of RL presented in Fig. 8(b) also appears larger than that of DL while the standard deviation is contradictory. Accordingly, it can be observed in Fig. 8(c) that the absolute value of the mean and standard deviation of the terminal position error of RL are smaller than the corresponding values of DL. As can be seen in Fig. 8(d), although there is not a great deal of difference in the mean value of altitude error between RL and DL, the altitude error standard deviations of RL is significantly less than DL which implies that control of altitude of DL is more sensitive to initial state uncertainties than RL.

In summing up, the statistical results compared in Table 6 of RL and DL make us conclude that the actor network obtained by RL greatly promotes the generalisation of online trajectory planning while DL pre-trained DNN is difficult to adapt to situations involving large initial state uncertainties.

5.0 Discussions

Based on the above design and simulation, there are three interesting points worthy of further discussion as follows.

  1. (1) The flight target preset in the simulation is used for the position estimation of the TAEM. Since the requirement of terminal position error for the TAEM is not strict, terminal position error of 19.57km is reasonable and acceptable. The subsequent flight in dive phase will accurately guide the hypersonic vehicle to the predetermined ground target. If the terminal high position accuracy of the reentry gliding phase is required, the position error can be eliminated by the terminal guidance methods, including the range and azimuth error guidance [Reference Bao, Wang and Tang2] and the relative line of sight angle guidance [Reference Bao, Wang and Tang1].

  2. (2) The simulation results indicate the landing accuracy and generalisation of the RL actor network are notably better than DL pre-trained DNN. A possible explanation may be that DL only learns the mapping relationships of inputs and outputs in samples and aims to minimise the error between outputs and samples. Different from the learning criteria of DL, RL aims to obtain the maximum reward in the environment rather than minimise the output error, so there is no overfitting of RL as in DL, which greatly improves the stability and reliability of the actor network. In the exploration vs. exploitation mechanism of RL, the actor network explores more unknown states through random actions that DL does not, which promotes generalisation and autonomy under unknown uncertainties.

  3. (3) The model basis for achieving intelligent properties such as intelligent planning and decision-making for hypersonic vehicles will still involve neural networks, due to end-to-end learning can simulate the intrinsic mapping information between observed states and decisive actions to the greatest extent possible to autonomously cope with various unknown and uncertain situations. When the neural network is deployed online after end-to-end learning, the computation time is much smaller than the complex optimisation and control algorithms. In a sense, it is particularly apt for online real-time planning and control.

6.0 Conclusions

In this paper, an onboard 3D trajectory generation method is designed based on the RL algorithm. To accelerate the convergence speed and success rate of RL, the pre-trained DNN is utilised to initialise the RL actor network. Based on the ODPDAC algorithm and reward function guided by the highest terminal accuracy, an onboard trajectory generator through end-to-end learning was established. Simulation results show that the actor network could directly output trajectory control commands in 0.429ms according to the motion state observed online. The RL-based planning method significantly improves the terminal accuracy of the trajectory. In the case of a biased initial position state, the RL actor network shows better generalisation.

Future work will focus on the challenges of small-sample learning, more flight constraints, and variable target points in the trajectory generation by RL. Additionally, considering environmental uncertainties will make the problem more complex. In conclusion, our work, involving the study of a new perspective and a deeper understanding of the application of artificial intelligence algorithms in flight control, has proved to be encouraging. These initial results will contribute to the development of intelligent control of the hypersonic vehicle.

Acknowledgment

This work was supported by the National Natural Science Foundation of China (Grant No. 62003355).

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest concerning the research, authorship, and/or publication of this paper.

References

Bao, C., Wang, P. and Tang, G. Integrated guidance and control for hypersonic morphing missile based on variable span auxiliary control, Int J Aerosp Eng, 2019, p 6413410. https://doi.org/10.1155/2019/6413410 Google Scholar
Bao, C., Wang, P. and Tang, G. Integrated method of guidance, control and morphing for hypersonic morphing vehicle in glide phase, Chin J Aeronaut, 2021, 34, (5), pp 535553. https://doi.org/10.1016/j.cja.2020.11.009 CrossRefGoogle Scholar
Zhang, W.J. and Wang, B.M. Predictor corrector algorithms considering multiple constraints for entry vehicles, Aeronaut J, 2022, pp 123. https://doi.org/10.1017/aer.2022.19 Google Scholar
He, R., Liu, L., Tang, G. and Bao, W.M. Rapid generation of entry trajectory with multiple nofly zone constraints, Adv Space Res, 2017, 60, (7), pp 14301442. https://doi.org/10.1016/j.asr.2017.06.046 CrossRefGoogle Scholar
Wei, C., Han, Y., Pu, J., Li, Y. and Huang, P. Rapid multilayer method on solving optimal endoatmospheric trajectory of launch vehicles, Aeronaut J, 2019, 123, (1267), pp 13961414. https://doi.org/10.1017/aer.2019.17 CrossRefGoogle Scholar
Dancila, R.I. and Botez, R.M. New flight trajectory optimisation method using genetic algorithms, Aeronaut J, 2021, 125, (1286), pp 618671. https://doi.org/10.1017/aer.2020.138 CrossRefGoogle Scholar
Chai, R., Tsourdos, A., Savvaris, A., Chai, S. and Xia, Y. Highfidelity trajectory optimization for aeroassisted vehicles using variable order pseudospectral method, Chin J Aeronaut, 2021, 34, (1), pp 237251. https://doi.org/10.1016/j.cja.2020.07.032 CrossRefGoogle Scholar
Rizvi, S.T.I., Linshu, H., Dajun, X. and Shah, S.I.A. Trajectory optimisation for a rocketassisted hypersonic boostglide vehicle, Aeronaut J, 2017, 121, (1238), pp 469487. https://doi.org/10.1017/aer.2017.11 CrossRefGoogle Scholar
Kwon, D., Jung, Y., Cheon, Y.J. and Bang, H. Sequential convex programming approach for realtime guidance during the powered descent phase of mars landing missions, Adv Space Res, 2021, 68, (11), pp 43984417. https://doi.org/10.1016/j.asr.2021.08.033 CrossRefGoogle Scholar
Sagliano, M., Mooij, E. and Theil, S. Onboard trajectory generation for entry vehicles via adaptive multivariate pseudospectral interpolation, AIAA Guidance, Navigation, and Control Conference, San Diego, California, USA, 2016, https://doi.org/10.2514/6.20162115 CrossRefGoogle Scholar
Sagliano, M., Heidecker, A., Macés Hernández, J., Farì, S., Schlotterer, M., Woicke, S., Seelbinder, D. and Dumont, E. Onboard guidance for reusable rockets: aerodynamic descent and powered landing, AIAA Scitech 2021 Forum, 2021, VIRTUAL EVENT. https://doi.org/10.2514/6.20210862 CrossRefGoogle Scholar
Shirobokov, M., Trofimov, S. and Ovchinnikov, M. Survey of machine learning techniques in spacecraft control design, Acta Astronaut, 2021 186, pp 8797. https://doi.org/10.1016/j.actaastro.2021.05.018 Google Scholar
Schmidhuber, J. Deep learning in neural networks: An overview, Neural Netw, 2015, 61, pp 85117. https://doi.org/10.1016/j.neunet.2014.09.003 CrossRefGoogle ScholarPubMed
Basturk, O. and Cetek, C. Prediction of aircraft estimated time of arrival using machine learning methods, Aeronaut J, 2021, 125, (1289), pp 12451259. https://doi.org/10.1017/aer.2021.13 CrossRefGoogle Scholar
Nie, W., Li, H. and Zhang, R. Modelfree adaptive optimal design for trajectory tracking control of rocketpowered vehicle, Chin J Aeronaut, 2020, 33, (6), pp 17031716. https://doi.org/10.1016/j.cja.2020.02.022 CrossRefGoogle Scholar
Shi, Y. and Wang, Z. Onboard generation of optimal trajectories for hypersonic vehicles using deep learning, J. Spacecr Rockets, 2021, 58, (2), pp 400414. https://doi.org/10.2514/1.A34670 CrossRefGoogle Scholar
Sánchez, C. and Izzo, D. Realtime optimal control via deep neural networks: study on landing problems, J Guid Control Dyn, 2018, 41, (5), pp 11221135. https://doi.org/10.2514/1.G002357 CrossRefGoogle Scholar
Cheng, L., Wang, Z., Jiang, F. and Li, J. Fast generation of optimal asteroid landing trajectories using deep neural network, IEEE Trans Aerosp Electron Syst, 2020, 56, (4), pp 26422655. https://doi.org/10.1109/TAES.2019.2952700 CrossRefGoogle Scholar
Tenenbaum, J.B., Kemp, C., Griffiths, T.L. and Goodman, N.D. How to grow a mind: statistics, structure, and abstraction, Science, 2011, 331, (6022), pp 12791285. https://doi.org/10.1126/science.1192788 CrossRefGoogle Scholar
Tsitsiklis, J.N. Asynchronous stochastic approximation and Q-learning, Mach Learn, 1994, 16, (3), pp 185202. https://doi.org/10.1007/BF00993306 CrossRefGoogle Scholar
Han, X., Zheng, Z., Liu, L., Wang, B., Cheng, Z., Fan, H. and Wang, Y. Online policy iteration ADP-based attitude tracking control for hypersonic vehicles, Aerosp Sci Technol, 2020, 106, p 106233. https://doi.org/10.1016/j.ast.2020.106233 CrossRefGoogle Scholar
Shi, Z., Zhao, F., Wang, X. and Jin, Z. Satellite attitude tracking control of moving targets combining deep reinforcement learning and predefinedtime stability considering energy optimization, Adv Space Res, 2022, 69, (5), pp 21822196. https://doi.org/10.1016/j.asr.2021.12.014 CrossRefGoogle Scholar
Gaudet, B., Linares, R. and Furfaro, R. Adaptive guidance and integrated navigation with reinforcement meta-learning, Acta Astronaut, 2020, 169, pp 180190. https://doi.org/10.1016/j.actaastro.2020.01.007 CrossRefGoogle Scholar
Gaudet, B., Linares, R. and Furfaro, R. Deep reinforcement learning for six degree-of-freedom planetary landing, Adv Space Res, 2020, 65, (7), pp 17231741. https://doi.org/10.1016/j.asr.2019.12.030 CrossRefGoogle Scholar
Gaudet, B., Linares, R. and Furfaro, R. Terminal adaptive guidance via reinforcement metalearning: Applications to autonomous asteroid closeproximity operations, Acta Astronaut, 2020, 171, pp 113. https://doi.org/10.1016/j.actaastro.2020.02.036 CrossRefGoogle Scholar
Zavoli, A. and Federici, L. Reinforcement learning for robust trajectory design of interplanetary missions, J Guid Control Dyn, 2021, 44, (8), pp 14401453. https://doi.org/10.2514/1.G005794 CrossRefGoogle Scholar
Zhao, Y., Yang, H. and Li, S. Real-time trajectory optimization for collision-free asteroid landing based on deep neural networks, Adv Space Res, 2022, 70, (1), pp 112124. https://doi.org/10.1016/j.asr.2022.04.006 CrossRefGoogle Scholar
LaFarge, N.B., Miller, D., Howell, K.C. and Linares, R. Autonomous closed-loop guidance using reinforcement learning in a low-thrust, multibody dynamical environment, Acta Astronaut, 2021, 186, pp 123. https://doi.org/10.1016/j.actaastro.2021.05.014 CrossRefGoogle Scholar
Xu, D. and Chen, G. Autonomous and cooperative control of UAV cluster with multi-agent reinforcement learning, Aeronaut J, 2022, 126, (1300), pp 932951. https://doi.org/10.1017/aer.2021.112 CrossRefGoogle Scholar
Zhou, Z.G., Zhou, D., Chen, X. and Shi, X.N. Adaptive actor-critic learning-based robust appointed-time attitude tracking control for uncertain rigid spacecrafts with performance and input constraints, Adv Space Res, 2022, p S0273117722003386, https://doi.org/10.1016/j.asr.2022.04.061 Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D. and Rusu, A.A. Human-level control through deep reinforcement learning, Nature, 2015, 518, (7540), pp 529533. https://doi.org/10.1038/nature14236 CrossRefGoogle ScholarPubMed
Silver, D., Lever, G. and Heess, N. Deterministic policy gradient algorithms, Proceedings of the 31st International Conference on International Conference on Machine Learning, 21–26, June 2014, 32, pp 387–395, Bejing, China.Google Scholar
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T. and Hassabis, D. Mastering the game of Go with deep neural networks and tree search, Nat, 2016, 529, (7587), pp 484489. https://doi.org/10.1038/nature16961 CrossRefGoogle ScholarPubMed
Zhou, X., Zhang, H.B., Xie, L., Tang, G.J. and Bao, W.M. An improved solution method via the pole-transformation process for the maximum-cross range problem, Proc ImechE G: J Aerosp Eng, 2020, 234, (9), pp 14911506. https://doi.org/10.1177/0954410020914809 CrossRefGoogle Scholar
Phillips, T.H. A common aero vehicle model, description, and employment guide, www.dtic.Mil/matris/sbir041/srch/af031a.doc, 2013.Google Scholar
Zhou, X., He, R.Z., Zhang, H.B., Tang, G.J. and Bao, W.M. Sequential convex programming method using adaptive mesh refinement for entry trajectory planning problem, Aerosp Sci Technol, 2021, 109, p 106374. https://doi.org/10.1016/j.ast.2020.106374 CrossRefGoogle Scholar
Figure 0

Figure 1. Configuration of proposed RL-based trajectory generation method.

Figure 1

Figure 2. Flow chart of ODPDAC.

Figure 2

Table 1. Values of initial states, terminal states and constraints

Figure 3

Table 2. Design of DNN structure and hyperparameters

Figure 4

Figure 3. Variation of the loss function during the training process.

Figure 5

Figure 4. Configuration of real-time trajectory generation by DNN.

Figure 6

Table 3. Terminal error and maximum constraints

Figure 7

Table 4. Structure parameter of A-C networks

Figure 8

Figure 5. Variation of terminal variables in RL.

Figure 9

Figure 6. Resulting curves of trajectories by RL policy network.

Figure 10

Table 5. Terminal errors and maximum constraints

Figure 11

Figure 7. Comparison of Monte Carlo experiments results.

Figure 12

Figure 8. Statistical charts of Monte Carlo experiments results.

Figure 13

Table 6. Statistical results of RL and DL experiments