1. Introduction
Over the last few decades, the application scenarios of mobile robots have gradually expanded from isolated static space to social space shared with human beings, such as hospitals, shopping malls, and canteens. In these crowd scenarios, humans frequently change their states, including moving direction, speed, acceleration, etc., to avoid collisions with nearby humans and obstacles. While these behaviors may appear random, they are in fact dynamic adjustments in response to the ever-changing environment [Reference Kruse, Pandey, Alami and Kirsch1–Reference Yin and Yin4]. This dynamic nature of human behaviors makes robot navigation in a crowded environment more complex than in an environment without humans and leads to the Freezing Robot Problem [Reference Fan, Cheng, Pan, Long, Liu, Yang and Manocha5–Reference Sathyamoorthy, Patel, Guan and Manocha7]. The robot cannot find a feasible path and falls into an oscillating or stopping situation. In order to navigate in dense crowds safely and socially compliantly, robots need to understand human behaviors and obey their cooperative rules [Reference Bachiller, Rodriguez-Criado, Jorvekar, Bustos, Faria and Manso8–Reference Truong and Ngo11].
Previous work attempts to jointly plan feasible paths for all the agents [Reference Trautman and Krause6,Reference Pfeiffer, Schwesinger, Sommer, Galceran and Siegwart12] or predict the future trajectories of the humans before planning [Reference Bennewitz, Burgard, Cielniak and Thrun13–Reference Aoude, Luders, Joseph, Roy and How15]. However, these methods suffer from the stochasticity of human behaviors and high computational cost in dense crowd scenarios. In recent years, deep reinforcement learning technologies have made significant progress in solving social navigation problems [Reference Zhou, Zhu, Zeng, Xiao, Lu and Zhou16–Reference Hu, Zhao, Zhang, Zhou and Liu18]. These methods train computationally efficient policies that implicitly encode the interactions and cooperations among agents. Despite the recent progress, navigation in the crowd with fast-moving pedestrians is not well investigated. In real-world applications, pedestrians in sparse environments or with emergency issues usually have random moving patterns in the crowd. As a result, the safety of existing models remains a concern.
We present a safety measure Risk-Area for safe robot navigation in densely populated environments as briefly shown in Fig. 1. The main contributions of this article are summarized as follows.
-
1. We introduce a dynamic model and analysis of the collision process between robots and pedestrians in-plane, providing a basis for assessing collision risk.
-
2. We incorporate human behavior and collision theory into the design of the Risk-Area, enhancing the understanding and implementation of socially aware navigation strategies.
-
3. We conduct extensive experiments and evaluations of our approach and state-of-the-art method in challenging simulation and real-world environments.
The rest of this paper is organized as follows. Section 2 provides the background and the problem formulation of the robot navigation in the crowd scenarios. Section 3 provides details of the Risk-Area design for the navigation problem. Section 4 presents the simulation and real-world experiment results. We conclude this paper in Section 5.
2. Background
2.1. Related work
Traditional methods attempt to tackle the robot navigation problem in crowded environments through well-engineered rules. A pioneer work develops social force model that defines attractive and repulsive forces to describe human interactions and has been successfully applied in both simulation and real-world environments [Reference Helbing and Molnar20–Reference Yang, Zhang, Chen and Fu22]. Reciprocal velocity obstacles (RVO) [Reference van den Berg, Lin and Manocha23] and optimal reciprocal collision avoidance (ORCA) [Reference van den Berg, Guy, Lin and Manocha24] are velocity-based breakthrough algorithms in multi-agent collision avoidance. They consider joint collision avoidance under reciprocal assumptions to achieve collision-free navigation. Interacting Gaussian process [Reference Trautman, Ma, Murray and Krause25] uses individual Gaussian process to model the trajectory of each agent and proposes an interaction potential term for interaction. Nevertheless, the above methods rely heavily on handcrafted design of interaction model, thus having difficulty generalizing complex human-robot interaction scenarios.
Learning-based methods focus on training strategies from human interactions to obtaining appropriate behaviors. For example, imitation learning approaches learn policies directly from expert demonstrations and map various inputs such as depth images, raw lidar data, and occupancy maps to robot actions [Reference Tai, Zhang, Liu and Burgard26–Reference Qin, Huang, Zhang, Guo, Ang and Rus28]. Inverse reinforcement learning captures human cooperation features from human interactions via the maximum entropy method [Reference Kretzschmar, Spies, Sprunk and Burgard29–Reference Konar, Baghi and Dudek31]. However, the outcomes of these methods are closely relevant to the quality of demonstrations.
Over the last few years, reinforcement learning has achieved great progress in solving real-time navigation problems in crowded environments. Long et al. [Reference Long, Fan, Liao, Liu, Zhang and Pan32] present a decentralized multi-robot collision avoidance algorithm, selecting actions directly from raw sensor data to achieve collision-free navigation. Chen et al. [Reference Chen, Liu, Everett and How33] develop a value network that encodes the estimated time to the goal based on the joint state of the agents. Subsequently, this work is extended by integrating social norms into reward functions in ref. [Reference Chen, Everett, Liu and How34]. In order to overcome the limitation of fixed observation size, Everett et al. [Reference Everett, Chen and How35] adopt the long short-term memory to encode agent states into a fixed-length vector. In ref. [Reference Chen, Liu, Kreiss and Alahi36], attention mechanism is proposed to perceive the importance of humans in the crowd by modeling human-robot and human-human interactions, significantly improving the robots’ decision-making ability. Thereafter, Chen et al. [Reference Chen, Liu, Shi and Liu37] utilize graph convolutional networks to extract the potential interaction features. The preceding studies succeed in capturing the cooperations and interactions of human behaviors. Nevertheless, the performance of the current models will degrade as the crowd complexity increases. Especially, most of existing models are overly optimistic about the speed of the robot and pedestrians. In order to achieve safe and comfortable navigation in social environments, the speed of mobile robots has to be significantly slower than pedestrians [Reference Kruse, Pandey, Alami and Kirsch1,Reference Bohannon and Andrews38–Reference Hoyt and Taylor40]. To address the issue, Samsani et al. [Reference Samsani and Muhammad19] predict the constrained action space around the human based on the current human velocity and formulate Danger-Zones (DZ) for the robot. However, this method does not fully consider the relative motion among the agents. In this study, we design Risk-Areas around each human as safety measures according to collision theory and the real-time relative movements.
2.2. Problem formulation
We consider a situation where a robot moves in a crowd of $n$ humans and reaches its goal without any collision. This task can be formulated as a sequential decision-making problem in a reinforcement learning framework [Reference Chen, Liu, Everett and How33]. It is assumed that humans do not avoid or intentionally hinder the robot during navigation. For each agent (human or robot), the position $[p_x,p_y]$ , velocity $[v_x, v_y]$ , and radius $r$ are observable to the others. The agent’s intended goal $[g_x, g_y]$ , preferred speed ${\textbf{v}}_{pref}$ , and heading angle $\theta$ are only aware by itself. At every time step, the robot can observe its full state $\textbf{s}_r$ and the observable state $\textbf{s}_{h_o}$ of other humans. The state of the robot is represented as $\textbf{s}_r=[p_x,p_y,v_x,v_y,r,g_x,g_y,v_{pref},\theta ]$ . The observable state of the $i$ -th human is defined as $\textbf{s}_{h_o}^i=[p^i_x,p^i_y,v^i_x,v^i_y,r^i]$ . A robot-centric frame is employed to make the state representation more universal, in which the origin is set at the current position of the robot $\textbf{p}_r = [p_x, p_y]$ , and the $x$ -axis points to the goal position $\textbf{g}_r = [g_x,g_y]$ . After transformation, the states $\textbf{s}_r$ and $\textbf{s}_{h_o}^i$ can be rewritten as follows:
where $d_g = \left \| \textbf{p}_r - \textbf{g}_r \right \|$ indicates the distance from the robot $\textbf{p}_r$ to the goal $\textbf{g}_r$ , and $d^i = \left \| \textbf{p}_r - \textbf{p}^i_h \right \|$ denotes the distance from the robot $\textbf{p}_r$ to the $i$ -th human $\textbf{p}_h^i$ . The joint state of the system at time $t$ for robot navigation is defined as ${\textbf{s}}_t^{jn}=[{\textbf{s}}_{rt},\textbf{s}^1_{h_o t},\textbf{s}^2_{h_o t},\ldots,\textbf{s}^n_{h_o t}]$ .
The robot is deemed to adjust its velocity ${\textbf{v}}_{rt}$ promptly in light of the action instruction ${\textbf{a}}_t$ . It is expected to find an optimal policy $\pi ^*$ to maximize the expected reward:
where $R({\textbf{s}}^{jn}_t,{\textbf{a}}_t)$ is the reward received at time $t$ , $\textbf{A}$ is the action space, $\gamma \in (0,1)$ is a discount factor, $V^*$ is the optimal value function, and $P({\textbf{s}}^{jn}_t,{\textbf{a}}_t,{\textbf{s}}^{jn}_{t+\Delta{t}})$ is the transition probability from time $t$ to $t+\Delta{t}$ . The preferred velocity $v_{pref}$ is used as a normalization term in the discount factor for numerical reasons.
2.3. Training strategy of the value network
The training procedure of the value network is outlined in Algorithm 1. At first, the value network model is trained with imitation learning using demonstration knowledge generated by ORCA (line 3-5) and subsequently trained by temporal-difference (TD) learning method with standard experience replay and fixed target network techniques (line 6-16) [Reference Mnih, Kavukcuoglu, Silver, Rusu, Veness, Bellemare, Graves, Riedmiller, Fidjeland, Ostrovski, Petersen, Beattie, Sadik, Antonoglou, King, Kumaran, Wierstra, Legg and Hassabis41]. In line 9, the $\epsilon$ -greedy strategy is applied to select actions. The next state $\textbf{s}^{jn}_{t+\Delta t}$ is obtained by querying the true value from the environment to mitigate the issue of system dynamics in training. During deployment, we use a simple linear model to predict the motion of the agents in $\Delta t$ , which has shown good accuracy on small time scales [Reference Bera and Manocha42].
To tackle the problem (2), the value network model needs to approximate the optimal value function $V*$ to maximize the expected reward. The reward design is critical for achieving safe and socially compliant robot navigation [Reference Sutton and Barto43]. Previous work on this track has not fully considered the risk in densely populated environments. In this paper, based on collision theory and real-time relative movements, we present Risk-Areas around each human as a safety measure.
3. Methodology
We outline the central idea of the Risk-Area in this section. First, we examine the human behavior and propose an impulse model of the human-robot collision. Then, we give the precise formulation and geometry of Risk-Area and discuss how the Risk-Area guides the robot to navigate in a safe and comfortable way.
3.1. Characteristic of the human and robot motion
Understanding the motions of both humans and robots is one of the bases of Risk-Area’s design. Humans’ navigation manners are closely related to a series of parameters, such as age, mood, and weight. According to previous work, the average walking speed of pedestrians is about 1.4 m/s [Reference Bohannon and Andrews38], and in some situations might go up to 2.5 m/s [Reference Hoyt and Taylor40]. In comparison, robots navigate with human-provided strategies, and the kinematic patterns of robots can be artificially determined. From the perspective of psychological effect, the approaching speed of mobile robots exceeding 1 m/s causes obvious discomfort to humans [Reference Kruse, Pandey, Alami and Kirsch1,Reference Butler and Agah39]. Moreover, the maximum speed of the common indoor mobile robots such as turtlebot2 [Reference Bai, Fan, Liu, Zhang and Zheng44] and PR2 [Reference Watts, Lancaster, Pedross-Engel, Smith and Reynolds45] is usually within 1 m/s. Due to the inertia, both robots and humans tend to stay in their original motion and posing more threats in this direction. It is not enough to model the risk only with the individual state. For example, in Fig. 2, the human state in (a) and (b) is the same while the collision risk is quite different. Therefore, the design of the collision risk should give more consideration to the relative movement of humans and robots.
3.2. Impulse model of the human-robot collision
The reward of the navigation system should reflect the aim of achieving safe and socially compliant robot navigation [Reference Sutton and Barto43]. Apart from the reaching target reward, the robot’s collision and other risky behaviors should be punished. Previous methods in refs. [Reference Samsani and Muhammad19] and [Reference Everett, Chen and How35] give every collision the same penalty without considering the relative movement of the agents, which does not correspond with the actual situation and influences the robot to judge the importance of different pedestrians. To address the issue, we take collision theory [Reference Brach46] and the relative motion among agents into account and design Risk-Area. After considering the mass and speed of common mobile robots and humans, the case of the robot colliding with the human in-plane is mainly discussed in this paper. Some typical collision assumptions [Reference Brach47] are made as follows:
Assumption 1. The duration of contact is enough short and the interaction forces are high.
Assumption 2. The effects of friction and other forces can be omitted.
Assumption 3. The masses of the human and the robot during the collision are constant.
The collision response contains the compression and expansion phases, as shown in Fig. 3. At the beginning of the compression phase, the bodies of the robot and the human start contacting and have a positive approaching speed $v_a$ to compress each other. After compressing, the approaching speed $v_a$ becomes zero and the velocity of both the human and the robot in the direction of the line of impact is the same. A period of expansion is followed during which the bodies of the human and the robot try to regain their original shape. At the end of the expansion phase, the bodies of the human and the robot separate with a negative approaching speed $v_a$ . The collision response has the relationship as follows:
where $m_r$ and $m_h$ are the masses of the robot and the human, $\textbf{v}_{rn}$ and $\textbf{v}_{hn}$ are the normal velocity of the robot and the human at the beginning of the compression phase, $\textbf{v}_n$ is the normal velocity of the human and the robot at the end of the compression phase, $\textbf{v}_{rn\tau }$ and $\textbf{v}_{hn\tau }$ are the normal velocity of the human and the robot at the end of the expansion phase, $I_1$ and $I_2$ are the impulses in the compression phase and expansion phase, and $e$ is the coefficient of restitution with the range $e \in [0, 1]$ , which is related to the materials. The impulse $I$ in the collision can be represented as follows:
Formula (4) reveals that the collision impulse is proportional to the relative approaching speed $v_a =\left \|\textbf{v}_{rn}-\textbf{v}_{hn}\right \|$ . From the perspective of reducing collision loss, the robot should avoid approaching the humans fast.
3.3. Penalty formulation of the risk-area
Due to the influence of sensor accuracy, control error, and other factors, robots cause high collision risk when they get too close to humans. Thus, we combined the position penalty $P_t^p$ into the design of the Risk-Area with the formulation as follows:
where $d_m$ is the minimum separation distance between the robot and the human within a duration $[t-\Delta t, t]$ , and $D_p$ is a threshold distance, which is set to 0.2 m.
According to the analysis in section III-B, the impulse of the collision is a linear function with respect to the human-robot relative approaching speed. Besides, fast approaching speed also causes discomfort to the human [Reference Butler and Agah39]. Therefore, the relative approaching speed of the robot and the pedestrian is applied as the velocity penalty of the Risk-Area to reflect the risk in the environment. Let ${\textbf{v}}_{ht}$ denote the velocity of the human at time $t$ . In practice, the relative approaching speed $v_{at}$ at time $t$ can be calculated as follows:
where $\textbf{p}_{rt}$ and $\textbf{p}_{ht}$ denote the position of the robot and human at time $t$ . When the time scale $\Delta t$ is small enough, we can assume the velocity of the robot and the human is constant. The form of the velocity penalty $P_{t}^v$ during the time period $[t-\Delta t, t]$ is defined as follows:
where $v_{rmax}$ and $v_{hmax}$ are the maximum speed of the robot and the human, respectively, $d_t$ is the separation distance between the robot and the human at time $t$ , and $D_v$ is a threshold distance with the form as follows:
where $m_v$ is an adjustable parameter related to the agents’ motion, which is set to 0.35 in this paper. The Risk-Area penalty function is the sum of the velocity penalty and the position penalty in the following form:
When navigating in the crowd, the robot will calculate Risk-Areas of the neighbors and try to avoid them. The penalty term of the reward function has been given in (9). Besides, the robot will be awarded when reaching the goal within a limited time. The final reward function has the following form:
3.4. Geometry of the risk-area
Since the Risk-Area penalty function is composed of position penalty and velocity penalty, the geometry of the Risk-Area is also divided into two parts: position Risk-Area and velocity Risk-Area. With the center of the human as the pole and the moving direction of the pedestrian relative to the robot as the polar coordinate axis, the polar coordinate system is established, as shown in Fig. 4 (a). The geometry of the position Risk-Area is a regular circle of radius $r_p=r_h+D_p$ . The boundary point $P=[\rho,\theta ]$ of the velocity Risk-Area needs to meet the conditions in (7) and has the formula as follows:
where $\textbf{v}_{hrt}=\textbf{v}_{ht}-\textbf{v}_{rt}$ is the velocity of the human relative to the robot at time $t$ , $\theta$ is the angle between $OP$ and the reference axis, and $\rho$ is the boundary distance of velocity Risk-Area in the direction of $\theta$ . Based on the penalty functions in (5) and (7), the contours of the Risk-Area are shown in Fig. 4 (b).
Considering the flexible human motion ability and uncertainties in the environment, we do not narrow the scope of the Risk-Area as the relative speed increases. Furthermore, the velocity Risk-Area leaves a large penalty space in the direction of the pedestrian relative to the robot so that it is more likely to receive a penalty when approaching the human fast. When the robot moves into a Risk-Area, it will get punished. For example, in Fig. 4 (c), Robot 1 is punished by position Risk-Area, Robot 2 and Robot 3 are not punished, and Robot 4 is punished by velocity Risk-Area. By introducing relative approaching velocity, the Risk-Area is able to perceive risk in the environment and navigate the robot safely and efficiently.
Proposition 3.1. Defining the direction of the Risk-Area as the unit vector $\textbf{n} = \frac{\textbf{v}_{hrt}}{\left \| \textbf{v}_{hrt}\right \|}$ , then the impulse of the collision is always maximized in the direction of the Risk-Area.
Proof. Consider $\textbf{p}_r=[l,\varphi ]\in \mathcal{S}_{RA}, \varphi \ne 0$ with collision happen. According to (4), the impulse $I_{pr}$ has the following form:
where $k$ is a positive constant. If the collision happens in the Risk-Area direction, the impulse is $I_n = k\left \|\textbf{v}_{hrt}\right \|\gt I_{pr}$ . This result implies that the direction of the Risk-Area indicates the maximum collision loss.
Proposition 3.2. In any direction of the velocity Risk-Area, the maximum boundary distance is proportional to the impulse of the collision.
Proof. Consider point $B = [l_B, \gamma ] \in \mathcal{S}_{vRA}$ is on the boundary of the velocity Risk-Area. According to (4), the impulse of the collision is $I_B = k\left \|\textbf{v}_{hrt}\right \|cos{\gamma }$ . The boundary distance $l_B = m_v\left \|\textbf{v}_{hrt}\right \|\cos \gamma + D_p + r_h$ satisfies the condition in (11). Then, $l_B$ can be represented by $I_B$ as $l_B = \frac{m_v }{k}I_B +(D_p + r_h)$ , where $\frac{m_v }{k}$ is constant. This result reveals that the robot motion with the risk of high collision loss is preferentially punished in the Risk-Area.
Proposition 3.3. Assuming the motion of the agent within a time interval $\Delta t$ can be approximated as constant velocity and $\Delta t \lt m_v$ , it is a sufficient condition for safe robot navigation that the robot does not enter the Risk-Area at any control moment $t_k \in \{t_0, t_1, \ldots, t_{end}\}$ .
Proof. For any given control moment $t_k$ and any pedestrian $i$ , the position of the robot satisfies $\textbf{p}_{rt_k} \notin \mathcal{S}_{RA}$ . Therefore, the distance between the robot and pedestrian $i$ fulfills the condition $d_{t_k} \gt D_v \geq D_p \gt 0$ , indicating that a collision does not occur at time $t_k$ .
For any time within the control interval $\forall T \in (t_k, t_{k+1})$ , the following holds. When $v_{a{t_k}} \leq 0$ , as shown by Robot 3 in Fig. 4 (c), we have $d_T \geq d_{t_k} \gt 0$ , indicating the absence of a collision at time $T$ . On the other hand, when $v_{a{t_k}} \gt 0$ , as shown by Robot 2 in Fig. 4 (c), with the progression of the approach motion, the angle $\theta$ becomes larger or remains 0, and the approaching speed $v_{a{T}} = v_{hrT} \cos{\theta }$ reaches its maximum at $t_k$ within the interval $(t_k, t_{k+1})$ . Then, the distance between the robot and pedestrian $i$ at time $T$ satisfies $d_T \geq d_{t_k} - v_{a{t_k}}(T - t_k) \gt v_{a{t_k}} m_v + D_p - v_{a{t_k}} \Delta t \gt 0$ , indicating the absence of a collision at time $T$ . This conclusion demonstrates that the Risk-Area ensures the safety of navigation.
Proposition 3.4. In the Risk-Area, the gradient of the velocity penalty function with respect to the position of the robot is always perpendicular to that of the position penalty function.
Proof. When the robot enters the Risk-Area without collision at time $t$ , the position penalty $P_{t}^p$ and velocity penalty function $P_t^v$ can be expressed by $\textbf{p}_{rt}$ and $\textbf{p}_{ht}$ as follows:
where $c_p$ and $c_v$ are positive constants. For each human, we assume that the position is static relative to the world coordinate system so as to analyze the influence of the robot’s position change on the risk penalty. Then, the partial derivative of $P_{t}^p$ and $P_{t}^v$ with respect to $\textbf{p}_{rt}$ are as follows:
where $\textbf{v}_{at\perp }$ denotes the component of velocity $\textbf{v}_{rht}$ perpendicular to $\textbf{v}_{at}$ . It is obvious that $\nabla P^{p}_{t}$ has the same direction with $\textbf{v}_{at}$ . The derivative $\nabla P^{p}_{t}$ is perpendicular to $\nabla P^v_{t}$ .
According to (2), the collision avoidance policy $\pi ^*({\textbf{s}}^{jn}_t)$ of the robot is inevitably affected by the current reward $R_{t}(\textbf{s}^{jn}_t,\textbf{a}_t)$ . The robot tends to move towards the direction of fast attenuation of punishment to maximum the final reward. Formula (16) shows that the position penalty decays most quickly in the direction of $-\nabla P^p_{t}$ , which encourages the robot to move away from the human. However, the direction of $-\nabla P^p_{t}$ is deviating from the target in some cases as shown in Fig. 5. In comparison, the gradient $-\nabla P^v_{t}$ is perpendicular to that of the position penalty, which encourages the robot to bypass the human effectively.
4. Experiments and results
4.1. Simulation environment setup
The simulation environment is built on Python with the PyTorch library. Two existing methods, position-based reward function [Reference Chen, Liu, Everett and How33] and DZ reward function [Reference Samsani and Muhammad19], are implemented as baselines. First, we use 3 k episodes of ORCA demonstrations to preprocess the model with imitation learning. Then, the model is trained 10 k episodes with the RL method. The RL parameters include a learning rate of 0.001 and a discount factor $\gamma$ of 0.9. All the agents in the simulation are assumed as circles with a diameter of 0.6 m. The minimum and maximum speed of pedestrians during crossovers is 0.5 m/s and 2.5 m/s [Reference Hoyt and Taylor40, Reference Minetti48]. Still, after considering safety, we set the maximum speed of humans as 3 m/s in our experiments. To fully evaluate the effectiveness of the proposed method, we look into two robot settings: holonomic robot setting and non-holonomic robot setting, along with three distinct pedestrian crossing scenarios: circle crossing scenario, square crossing scenario, and group crossing scenario. In the non-holonomic robot setting, the robot’s action space is allowed in 16 directions evenly spaced between 0 to $\frac{\pi }{2}$ with five exponentially spaced speeds between 0 to 1.0 m/s resulting in 80 actions. The humans in the simulation navigate following ORCA policy. In the holonomic robot setting, the robot can move in 16 directions evenly spaced between 0 to $2\pi$ with five exponentially spaced speeds between 0 to 1.0 m/s. In the circle crossing scenario, the initial positions of the humans are on a circle of radius 6 m, and the goals of the humans are set exactly opposite to the initial position. In the square crossing scenario, the initial positions of humans are set along any pair of opposite sides of a square with a side length of 6 m, while their destinations are generated randomly on the opposite side of the initial positions. In the group crossing scenario, several static groups of 2-5 individuals are generated within the environment, while the remaining pedestrians are generated to engage in circle crossing motion. In the first two crowd scenarios, the number of humans is set to 5, while in the group crossing scenario, the number of humans ranges from 5 to 10. In the square crossing scenario, the radius of humans is set between 0.3 m and 0.5 m, while in the other crossing scenarios, the radius of humans is uniformly set to 0.3 m. The robot in all settings is set invisible to the human.
We employ 10 k sets of pedestrian trajectories generated by the ORCA as training data, similar to refs. [Reference Samsani and Muhammad19] and [Reference Chen, Liu, Kreiss and Alahi36]. These datasets encompass detailed information, such as pedestrian positions, velocities, and target destinations, ensuring the effective learning and inference capabilities of our models. Additionally, we subjected the robot to testing in 500 different scenarios. Performance measures include Success Rate, Collision Rate, Navigation Time, Danger Frequency, and Minimum Separation Distance. Success Rate (SR) is the ratio of the number of experiments that the robot reaches its goal to the number of total experiments. Collision Rate (CR) is the ratio of the number of experiments that the robot collides with humans to the number of total experiments. Navigation Time (NT) is the average time for the robot to reach its goal. Danger Frequency (DF) is the ratio of the number of times that the robot gets too close to the human ( $d_m\lt 0.2m$ ) to the total time of navigation. Minimum Separation Distance (MSD) is the average minimum separation distance when the robot gets too close to the human.
4.2. Performance comparison with different RL methods
The safety of the navigation process is the most important in evaluating the navigation performance of the robot in crowded environments. To test the performance of different reward functions, we select three existing well-known deep reinforcement learning methods, including collision avoidance with deep reinforcement learning (CADRL) [Reference Chen, Liu, Everett and How33], long short-term memory with reinforcement learning (LSTM_RL) [Reference Everett, Chen and How35], and social attention with reinforcement learning (SARL) [Reference Chen, Liu, Kreiss and Alahi36] as the base training models. All the reinforcement learning models are trained with four different reward functions: position-based reward function, DZ reward function, RVO reward function, and RA reward function proposed in this paper. To facilitate a comprehensive comparison of the model performance, we also test the ORCA algorithm in the experiment. However, it should be noted that the original ORCA is designed for holonomic robots. In this paper, we conduct the ORCA experiment within the holonomic robot setting to maintain consistency with its original scope.
The experiment results of the non-holonomic robot are listed in Table I. The bold data indicate the best results. In the three reinforcement learning models, SARL performs best in both safety and task accomplishment by introducing the attention mechanism and aggregating multi-agent information. From the perspective of reward functions, the models trained with the position-based reward function show poor safety and task accomplish capability because the velocity between the robot and the human is neglected. As a result of taking pedestrian actions into account, the DZ method achieves a better safety performance than the position-based reward function in terms of danger frequency and collision rates. Compared with the previous methods, the Risk-Area incorporates the relative motion of agents into the design of the Risk-Area, and the models combined with the proposed reward function achieve superior performance in terms of safety and task accomplishments. In particular, SARL-RA achieves the best performance of all methods. The collision rate and the danger frequency are the lowest, and the success rate and minimum separation distance are significantly improved compared with that of the previous methods. Moreover, the navigation time listed in the table shows the model trained by the proposed reward function navigates the robot in an efficient way. The experimental results obtained from three different crowd settings are similar, and our method consistently achieves better results in both task accomplishment and safety metrics.
The navigation performance in the holonomic robot setting is shown in Table II. Due to the fact that the robots in the environment are set invisible to pedestrians, the ORCA algorithm violates the reciprocal assumption, resulting in poor navigation performance. The success rate and safety metrics in various scenarios are lower than those of reinforcement learning methods. The reinforcement learning model SARL still outperforms CADRL and LSTM_RL in both security and task accomplishment. Furthermore, all the three deep reinforcement learning models trained with the proposed reward function have better performance. In the three different crowd scenarios, the average success rate of CADRL-RA is 12.3% and 9.0% higher than that of CADRL and CADRL-DZ, respectively. Moreover, the average minimum separation distance and collision rate of CADRL-RA are improved compared to that of the previous reward functions. For LSTM_RL, the model trained with the proposed reward function also achieves higher safety performance while maintaining less navigation time than the previous methods. As for SARL, SARL-RA reaches the best performance among all the algorithms by achieving the highest safety and task accomplishment. The success rate of SARL-RA is higher than that of SARL and SARL-DZ, and the danger frequency is significantly decreased. In the group crossing scenarios, the danger frequency of SARL-RA achieves 0.01, which are the lowest among all models. In general, our approach demonstrates superior performance compared to the state-of-the-art method, Danger Zone, across various metrics. It achieves an average increase of 7.2% in success rate, while reducing the collision rate by 4.7% and the danger frequency by 52.4% in all the experiments. These results confirm the effectiveness of our approach in mitigating collision risks and enhancing overall navigation performance.
Fig. 6 shows the non-holonomic robot navigation results of the three reward functions based on the SARL algorithm in three different crossing scenarios. For fairness, the test cases in the same crossing scenario have the same human trajectories. In Fig. 6 (a), SARL adopts aggressive avoidance behavior and attempts to move in front of Human 2, leading to the dangerous situation at 8.75 s. Similar situations also arise in Fig. 6 (d) and Fig. 6 (g), where SARL exhibits risky navigation behavior during interactions with mobile humans. DZ method pays too much attention to the forward direction of the fast-moving pedestrians while ignoring the possible risk around. In Fig. 6 (b), the robot following SARL-DZ tries to move behind Human 2 but gets too close to the human side. Moreover, the DZ method converges to the original position-based penalty when dealing with interactions with stationary pedestrians, leading to situations where the robot gets too close to static pedestrians, as illustrated in Fig. 6 (h). In comparison, SARL-RA selects an appropriate collision avoidance strategy and keeps a safe distance from all humans. It is because our approach establishes Risk-Area based on the collision theory and relative movement of the robot and pedestrians while taking into account all directions of human-robot interaction. According to (16), the introduction of relative speed in the Risk-Area also contributes the robot to learning effective collision avoidance actions, thus ensuring navigation efficiency. Experiment results reveal that the proposed reward function of the Risk-Area along with the SARL algorithm exhibits safe and socially compliant navigation.
4.3. Performance comparison in different dense crowds
In actual cases, the number of humans in the environment is changeable. To test the safety of the model, we further compare the navigation performance of our reward function with the baselines in the different dense crowds. The experiments are conducted based on the models with the best performance in Section 4.2, including SARL, SARL-DZ, and SARL-RA. The human number of the crowd is set to $N\in [7,9,11,13,15]$ . In the non-holonomic robot setting, with the crowd density increasing, all three models’ task accomplishment and safety decrease, as shown in Fig. 7 (a)-(c). The SARL trained with the position-based reward function has the fastest performance degradation. SARL-DZ achieves a better task accomplishment in the different dense crowds than SARL. In comparison, SARL-RA achieves the best performance, maintaining the highest success rate and minimum separation distance. The danger frequency of SARL-RA is also the lowest in all the scenarios. Experiment results show that the proposed reward function improves the safety of the model in different crowd density scenarios.
The experiment results in the holonomic robot setting are shown in Fig. 7 (d)-(f). Similar to the non-holonomic robot setting, the performance of SARL degrades fast as the human number in the environment increases. The success rate of SARL is only 62% when the human number reaches 15. Benefiting from the consideration of human behaviors, SARL-DZ achieves a lower danger frequency than SARL. However, the minimum separation distance is not ideal due to the neglect of the relative movement among agents. It can be seen from the experiment that SARL-RA is least affected by the increase in crowd density. The model trained with our approach has the best performance in both success rate and safety. The experiment results reveal that the reward function is noteworthy to the navigation performance of the model. Moreover, by introducing relative movement and collision theory to represent the risk in the environment correctly, the model generalizes well in different dense crowd navigation scenarios.
Fig. 8 visualizes some typical results in holonomic robot setting for all the three models, including SARL, SARL-DZ, and SARL-RA. The methods with risk perception mechanisms have more efficient avoidance behaviors and reduce the freezing robot problem. The avoidance behaviors of SARL are aggressive, with a high frequency of danger occurring during traveling, as shown in Fig. 8 (a) and (b). Moreover, in Fig. 8 (c)-(d), SARL suffers from the freezing robot problem seriously, and the path has multiple diversions, which wastes time and causes more dangerous situations during navigation. Benefitting from the prediction of the human behaviors, SARL-DZ ameliorates the freezing robot problem, but unnatural and risky behaviors still occur during collision avoidance, as shown in Fig. 8 (e). In contrast, SARL-RA makes the robot take clear paths to ensure safety and comfort instead of entering the human cluster since it considers the risk of all humans around based on the relative movement. In general, the proposed method, combined with the collective importance of the crowd of SARL, is able to avoid the freezing robot problem and has a better generalization ability in different dense crowd scenarios.
4.4. Real-world experiment
Besides the simulation experiments above, we also conduct real-world experiments to verify the effectiveness of the proposed method. We implement the SARL-RA policy with the best performance in the non-holonomic robot setting simulation on a Turtlebot2 platform in a human-robot coexisting indoor environment. The robot is equipped with a SLAMTEC RPLIDAR A3 for pedestrian detection, and the action commands are generated by an IRU mini PC installed with Ubuntu 16.04 and ROS Kinetic. The positions and velocities of the robot and pedestrians are obtained using the ROS package amcl and leg_detector, respectively. The crossing scenarios in the real environment are similar to that in the simulated environment, including circle, square, and group settings. However, in the real environment, after completing the same navigation tasks as in the simulated environment, pedestrians continue to move randomly based on their own intentions. Additionally, due to sensor delays, pedestrian movements exhibit higher uncertainty compared to the simulated environment.
The navigation results in real-world environments are presented in Table III. Among all the experimental cases, only SARL has one collision. Both the SARL-RA and SARL-DZ safely complete the navigation tasks. Our method has the fewest occurrences of dangers among all the algorithms while maintaining a favorable minimum separation distance. In terms of navigation time, due to the maximum speed limitation of the turtlebot2 robot, the navigation time in the real environment inevitably becomes longer than in the simulated environment. In comparison, SARL-RA achieves the shortest average navigation time in the navigation cases.
Fig. 9 shows the robot navigates with the crowd. The numbers in red are the participant numbers. The blue dashed lines with arrows indicate the trajectories of the participants. The purple dots indicate the trajectory of the robot. The green arrow represents the action command generated by the SARL-RA algorithm. The red arrow denotes the current pose of the robot. The white arrows stand for the people’s velocity detected by the 2D lidar. The red dots are the local targets generated by the robot navigation frame. The specific navigation cases demonstrate the robot’s effective avoidance of fast-moving pedestrians in the environment. As observed in Fig. 9 (a) and Fig. 9 (b), the robot proactively executes avoidance maneuvers for pedestrians approaching from the opposite direction and timely creates pathways for pedestrians moving in the same direction. Additionally, Fig. 9 (c) reveals that the robot maintains a safe distance from stationary pedestrians and selects appropriate routes to evade them. Throughout the navigation scenarios, no collisions or obstructions to pedestrian movement are encountered. Experiments in the real world demonstrate that our Risk-Area method, combined with the collective importance of crowd of SARL, develops secure and socially compliant robot navigation.
5. Conclusions and future work
This paper proposes a safety mechanism named Risk-Area within reinforcement learning framework for safe socially aware robot navigation in fast-moving and densely populated environments. First, we establish the collision model of the human and the robot in-plane and formulate Risk-Areas around each human based on the relative movement of the agents. Then, the reward function and the geometry of the Risk-Area are established to ensure the safety of the crowd-robot navigation. After that, in the deep reinforcement learning framework, the robot is trained to understand the real-time human-robot interactions and not enter the Risk-Area. Experiment results confirm that our approach outperforms the state-of-the-art method in terms of safety and task accomplishments. In addition, by introducing the Risk-Area, the safety performance in different dense crowds of the model has also been improved. Finally, we deploy our method combined with SARL on a Turtlebot2 robot, which successfully navigates in real-world environments among fast-moving pedestrians.
While our approach has proven effective in enhancing safety in various crowded scenarios, the current risk representation based on collision theory lacks an expression for irregularly shaped moving objects, necessitating further research. Additionally, our adoption of a sparse reward function, which lacks rewards related to the robot’s proximity to the goal, restricts improvements in navigation efficiency. In future work, we plan to 1) apply camera resources with depth, semantic labels, and optical flows into our system to resolve the limitation of only using 2D LiDAR information; 2) improve our reward function to incentivize more efficient navigation; 3) and extend our risk representation based on collision theory to more general environments involving pedestrians and other static obstacles.
Author contributions
ZF, BX, and CW conceived and designed the study. ZF and BX conducted the experiments and performed statistical analyses. ZF, BX, and CW wrote the article. CW and FZ reviewed and edited the manuscript.
Financial support
This work was supported in part by the Key R & D Project of Shandong Province under Grant (2023TZXD018), the Jinan “20 New Colleges and Universities” Funded Scientific Research Leader Studio under Grant (2021GXRC079), the Central Government Guiding Local Science and Technology Development Foundation of Shandong Province under Grant (YDZX2023122), and the Jinan City and University Cooperation Development Strategy Project under Grant (JNSX2023012).
Competing interests
The authors declare that they have no conflict of interest.
Ethical approval
None.