1. Introduction
Assembly task is a common basic task in robot manufacturing industry, including axle assembly, printed-circuit board assembly, and so on, which faces high density, high complexity, large scale, and flexible production needs [Reference Gao, Li, Yu and Qiu1]. Applying reinforcement learning (RL) to robot assembly tasks can help to meet the above requirements. In recent years, deep reinforcement learning (DRL) algorithms have fueled a wealth of publicized achievements of artificial intelligence in robotics [Reference Fujimoto, Hoof and Meger2–Reference Haarnoja, Zhou, Hartikainen, Tucker, Ha, Tan, Kumar, Zhu, Gupta, Abbeel and Levine5]. However, there are still some limitations in terms of reliability and robustness. Indeed, some recent research shows that the existing DRL methods are brittle, brutal to transfer, unreliable in reality and sometimes over-fitting. It is still difficult to generate an ideal skill model for a robot manipulator [Reference Kwan, Wang, Wang and Wong6]. Therefore, it is necessary to find a way to realize the practical application of robot RL.
Path planning is a traditional method to solve the robot manipulation problem, but sometimes difficult to meet with an unstructured environment and join the expert experience [Reference Li and Qiao7]. RL is a novel and effective method for path planning, which can help robots automatically learn about environmental changes to optimize models in an unknown environment. However, the exploration-exploitation dilemma is a motivating challenge for the performance of RL algorithms [Reference Khlif, Nahla and Safya8]. We introduce the enlightenment learning incorporated to balance exploitation and exploration, as excessive exploration leads to a decrease in cumulative returns, while excessive development locks agents in local optima. Learning from demonstration (LfD) is a natural solution for joining the expert experience [Reference Dong, Si and Yang9]. References [Reference Qiao, Wu, Zhong, Yin and Chen10, Reference Qiao, Zhong, Chen and Wang11] adequately illustrate that human-inspired approaches can effectively improve the performance of robots. Imitation learning can use demonstrations of successful behavior to train policies that imitate the expert [Reference Schaal, Ijspeert and Billard12]. However, how to obtain the motion closer and better to the human experience is still a challenge. Reference [Reference Wen, Lian, Bekris and Schaal13] proposed a visual method to generate the demonstration data which is more human-like and can quickly train the robot for different manipulation strategies without otherwise complicated manual programs. However, vision-based demonstration introduces new problems. It needs to convert the human hand motion into robot hand motion to command the robot, called motion retargeting. In this paper, we adopt accurate camera calibration and coordinate transformation to solve that.
The pretraining process is pivotal in RL, offering advantages such as accelerated convergence, stability, and improved sample efficiency [Reference Rajeswaran, Kumar, Gupta, Vezzani, Schulman, Todorov and Levine14, Reference Vecerik, Hester, Scholz, Wang, Pietquin, Piot, Heess, Rothörl, Lampe and Riedmiller15]. It provides a beneficial starting point by initializing the model with task-specific knowledge, helping mitigate issues like poor convergence, and facilitating the transfer of learning between related tasks. In the realm of DRL, pretraining is particularly valuable for initializing neural network weights, overcoming challenges associated with deep network training, and guiding more informed exploration strategies. Overall, pretraining enhances the overall efficiency, robustness, and adaptability of RL algorithms, especially in complex environments with high-dimensional state spaces [Reference Vecerik, Hester, Scholz, Wang, Pietquin, Piot, Heess, Rothörl, Lampe and Riedmiller15]. The success of pretraining in RL relies on the quality and size of the dataset used. A high-quality dataset should be representative, diverse and capture relevant aspects of the target environment, providing the model with a comprehensive understanding of the task. The dataset’s size is a critical factor, as larger datasets offer more diverse experiences, enabling better generalization. However, the size should be balanced with computational considerations [Reference Zhang, Feng, Wang, Xu, Xu, Liu and Du16]. A dataset’s relevance to the target task is paramount, ensuring that task-specific features contribute to the model’s adaptability. Striking the right balance between dataset quality and size is essential for optimizing the pretraining process and enhancing the model’s overall performance. In this paper, leveraging human visual demonstrations in pretraining datasets for RL enhances the model’s adaptability, decision-making processes, and overall performance, while providing a rich, diverse, and contextually relevant set of experiences.
Owing to the limitation of gathering real-world data, that is, sample inefficiency and difficulty in data collection, simulation environments are more convenient for training RL agents. Nonetheless, when we prefer to transfer the agent trained by the simulation to the real world, the gap between reality and simulation will reduce the performance of the policy. Multiple research efforts toward closing the sim-to-real gap and solving the problem of sim-to-real transfer. As in survey [Reference Zhao, Queralta and Westerlund17], the main methods of sim-to-real transfer in DRL being utilized at the moment include these aspects: domain randomization, domain adaptation, imitation learning, meta-learning, and knowledge distillation. In order to complete the learning task of robot assembly skills for general use in reality, in this paper, we propose a novel framework that can transfer the policies trained in simulation to the real environment with a sim-to-real controller as shown in Fig. 1. The problems addressed in this paper are as follows:
How to generate more human-like demonstration in reality: We introduce a two-step method to develop manipulation trajectory from a single third-person-view visual demonstration. Our approach uses a binocular camera mounted on the outside of the robot. Then we transfer the trajectory from the camera coordinate system to the robot coordinate system through the calibration parameters of the camera.
How to train the RL model more efficiently: There are three main ways to incorporate demonstration experience in RL: pretrain the RL model, add the imitation learning loss while training, and put the demonstration data into a replay buffer. Since proximal policy optimization (PPO) is an on-policy RL method, we pretrain the RL network to learn the expert experiment.
How to fill up the sim-to-real gap: By collecting visual information with the binocular camera in reality, it is able to generate the demonstration trajectory and achieve automatic error correction to fill up the reality gap. In addition to this, we also adopt the domain randomization method during training to improve the robustness of the policy.
We carry out our experiments in the MuJoCo simulator. The primary contributions of this paper can be summarized as follows:
-
1. We propose a vision-based method to extract demonstration trajectories in realistic industrial scenarios by SoLo and iterative closest point (ICP) to accurately extract realistic teaching trajectories within 2 mm of accuracy. We improve the ICP and make it more accurate.
-
2. We develop an assembled RL training environment and successfully transferred real-world data to simulation training. We assume the agent is in an unknown unstructured environment and learns by exploration. We used enlightenment learning to improve PPO to learn from the demonstration experience. The success rate increased from 80% to 96%, and the generated policy can achieve the target faster.
-
3. In order to apply the policies in realistic industrial application scenarios, we reduce the sim-to-real gap by a visual error estimation and develop a sim-to-real controller. We discuss the proposed approach on the recently established NIST gear assembly benchmark [Reference Kimble, Van Wyk, Falco, Messina, Sun, Shibata, Uemura and Yokokohji18] to make a professional comparison. We compare the proposed method with the demonstration and some other similar approaches, such as original PPO and DDPGfD.
The rest of the paper is organized as follows. In the next section, we will introduce recent research on the related work. Section 3 introduces the preparatory knowledge required for this paper. Section 4 presents the overall framework and introduce each part of our system framework in detail. This section shows the contribution of the proposed approach. Section 5 shows the details of the experimental result. Finally, we draw up our conclusions in Section 6. Our studies show great performance of the proposed methods on efficiency.
2. Related work
Path planning is a key research issue in the field of robotic applications including local and global planning. For robot assembly, planning includes strategic planning and trajectory planning. Reference [Reference Wu, Liu and Wang19] used the analysis of contact states in insertion stage and the forces conditions to make a strategy planning and combined with impedance control. Reference [Reference Lee and Ro20] develop a planning strategy of path finding and grasp planning. The trajectory planning methods include A* [Reference Hart, Nilsson and Raphael21], D* [Reference Stentz22], RRT [Reference Kuffner and LaValle23], and etc. The map-based motion planning methods are numerically demonstrated, but when encountering unknown or unstructured environments, their effectiveness is not particularly good. Reference [Reference Dong, He, Song and Sun24] summarizes the representative and state-of-the-art works of the classical motion planning architecture and RL-based approaches. As the complexity and randomness of the environment, the planning capability of the classical hierarchical motion planners is challenged. Reference [Reference Li and Qiao7] illustrated that most of the planning method for robotic assembly are found to be sensing information based and not on the integration of the sensing information and the environmental constraint. Human-inspired methods have not yet been applied to assembly tasks either. Reference [Reference Inoue, De Magistris, Munawar, Yokoya and Tachibana25] illustrated that DRL shows a better fitting performance with a tighter clearance and robustness against positional and angular errors for the peg-in-hole task. We establish the assembly problem in an unknown unstructured environment and try to obtain an effective and robust strategy in the paper.
Recent research shows possibilities to learn manipulation skills with RL. The existing RL algorithms applied in the field of robotics include TD3 [Reference Fujimoto, Hoof and Meger2], DDPG [Reference Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver and Wierstra3], PPO [Reference Schulman, Wolski, Dhariwal, Radford and Klimov4], SAC [Reference Haarnoja, Zhou, Hartikainen, Tucker, Ha, Tan, Kumar, Zhu, Gupta, Abbeel and Levine5], and so on. It can be divided into model-free and model-based methods or into on-policy and off-policy methods. In this paper, we apply PPO, a model-free and on-policy RL algorithm, to realize the manipulation task of the robot. Model-free RL methods are capable of training an agent to act directly on the environment. The on-policy RL also helps the training converge faster. However, reference [Reference Dong, He, Song and Sun24] introduced the challenge of RL-based motion planning methods, such as reality gap, reward sparsity problem, low sample efficiency, and generalization problem. To settle these problem, we have conducted research and found some effective methods.
LfD can be used to combine RL with demonstrations, such as dynamic movement primitives (DMPs) [Reference Theodorou, Buchli and Schaal26–Reference Kong, He, Chen, Zhang and Wang29]. These methods use trajectory-centric policy representations that are well suited for imitation but do not enable feedback on rich sensory inputs. In some recent work, demonstrations have been used for pretraining the policy, such as DQfD [Reference Hester, Vecerik, Pietquin, Lanctot, Schaul, Piot, Sendonaris, Dulac-Arnold, Osband and Agapiou30]. Pretraining RL policies using trajectories from an expert can help to accelerate training process. On the other hand, behavior cloning (BC) treats the problem of imitation learning, i.e., using expert demonstrations, as a supervised learning problem [Reference Raffin, Hill, Gleave, Kanervisto, Ernestus and Dormann31], which is also an effective imitation learning method. DDPGfD [Reference Vecerik, Hester, Scholz, Wang, Pietquin, Piot, Heess, Rothörl, Lampe and Riedmiller15] successfully combines RL with demonstrations by incorporating the demonstrations into the RL algorithm. It is suited for off-policy RL method by adding them to the replay buffer. Besides, DAPG [Reference Rajeswaran, Kumar, Gupta, Vezzani, Schulman, Todorov and Levine14] bootstraps the policy using behavior cloning and combines demonstrations with an on-policy policy gradient method. On-policy RL methods, on the other hand, are more stable and scale well to high-dimensional spaces.
The traditional demonstrating method needs to drag the end of the robot to perform the manipulation task or act with the teleoperation [Reference Xu, Yang, Liu and Li32, Reference Xu, Yang, Zhong, Wang and Zhao33]. Instead, by acquiring the demonstration data in a visual way, the expert experience obtained is closer to human actions. Such as [Reference Wen, Lian, Bekris and Schaal13], the author proposes a closed-loop, category-level manipulation framework based exclusively on visual feedback with a robot teaching method from a single visual demonstration. For the visual demonstrating method, pose estimation is an important part. In some recent research, there are also some end-to-end pose estimation methods based on deep learning [Reference He, Sun, Huang, Liu, Fan and Sun34, Reference Lin, Wang, Ling, Tao and Yang35], but the precision and accuracy of these methods are difficult to achieve the desired results. With the development of computer vision, instance segmentation has been able to recognize and segment targets effectively, such as Mask-RCNN [Reference He, Gkioxari, Dollár and Girshick36] and SoLo [Reference Wang, Zhang, Kong, Li and Shen37]. For pose estimation, leverage geometric estimators such as RANSAC/PnP [Reference Zakharov, Shugurov and Ilic38] and feature matching such as PFH [Reference Rusu, Blodow, Marton and Beetz39] and FPFH [Reference Rusu, Blodow and Beetz40] can be used to solve object poses, but the processing time is relatively long. In our work, we proposed a two-step object pose estimation method using instance segmentation [Reference He, Gkioxari, Dollár and Girshick36, Reference Wang, Zhang, Kong, Li and Shen37] and ICP [Reference Besl and McKay41] to extract the pose from the visual data and produce a demonstrative trajectory. Benefiting from the geometric information contained in the point cloud, the pose estimation method of combining the image and point cloud is more accurate.
In recent years, there have been some transfer methods for RL agents to act in the real world. References [Reference Chen, Zeng, Wang, Lu and Yang42] use a method similar to digital twins with force information for zero-shot sim-to-real transfer. It corrects the errors in real time through the force sensor and achieves a lower reality gap. Reference [Reference Tobin, Fong, Ray, Schneider, Zaremba and Abbeel43] uses domain randomization for transfer and help raise in domain error, but substantially reducing out of domain error. Domain randomization also plays an important role in error correction in our work. Reference [Reference Arndt, Hazara, Ghadirzadeh and Kyrki44] uses the domain adaption, which crosses the sim-to-real gap by mapping the simulation and reality into a latent space. On the other hand, using real-world data also benefits transferring the model. In this paper, we use a visual demonstration approach that allows direct access to human expert experience.
3. Preliminary
Figure 2 shows the visual demonstration trajectory extraction method proposed in this article. To generalize a trajectory from the visual demonstration, we need to first identify the object and estimate its position. Thus, we proposed a two-step object pose estimation method using instance segmentation and ICP. In this paper, we can achieve a fast identification and segmentation of the target by SoLo. Then we can generate the point cloud of the object from the mask. We used improved ICP [Reference Besl and McKay41] algorithm of the point cloud to achieve the estimation of the target pose. In the experiments of this paper, about 20 iterations of each point cloud are performed to obtain the accurate target position pose.
The RL problem can be defined as a policy search in a Markov decision process (MDP). RL can be divided into two types: on-policy and off-policy RL. The PPO algorithm uses fixed-length trajectory segments. In each iteration, we collect T timesteps of data and store each transition $\left (s, a, r, s^{\prime }\right )$ in memory to calculate the advantages. $(s, a, r, s^{\prime })$ represents the state, action, reward, and the next state. $V\left (s\right )$ is the value of $s$ calculated by the critic. The advantage-function estimator is defined as follows:
where $\hat{A}_t$ specifies the advantage at $t$ in [0, T], within a given length-T trajectory segment. $\gamma$ is the discount factor of the reward, and $\lambda$ is the discount factor of the steps. After calculating the advantages, we should set a hyperparameter $\epsilon$ for the cipped surrogate objective. The clipped surrogate objective is calculated as follow:
To update the neural network architecture that shares parameters between the policy and value function, the loss function consists of three terms: the policy surrogate, a value function error, and an entropy bonus.
where $c_1$ , $c_2$ are coefficients, $S$ denotes an entropy bonus, and $L_t^{V F}$ is a squared-error loss $\left (V_\theta \left (s_t\right )-V_t^{\operatorname{targ}}\right )^2$ .
In reference [Reference Schulman, Wolski, Dhariwal, Radford and Klimov4], there are two approaches to complete the PPO algorithm, including the clipped surrogate objective and the adaptive KL penalty coefficient. In this paper, we only used the former approach because of the weak effect of the latter method on the results.
4. Methods
In this paper, we propose a training pipeline as shown in Fig. 3 for robots to learn assembly skills. The system framework consists of three major parts: demonstration pretraining, PPO training, and sim-to-real transferring. To work in the real world, we estimate the sim-to-real error through the vision and design a sim-to-real controller to fill up the sim-to-real gap. For transferring the training model from sim to real, we also adopt domain randomization while training which plays an important role in error correction. In our framework, all the demonstration steps are completed in reality.
4.1. Visual demonstration
For the high-precision manipulation task with rich contact proposed in this article, the agent needs to perform ordered fine-grained motion to achieve success. Adding demonstration is an intuitive way to display possible solutions to tasks to agents and can guide reasonable initial strategies. Due to the lack of clear modeling potential factors, expert demonstrations provided by humans typically exhibit significant variability [Reference Li, Song and Ermon45]. We proposed a novel method for extracting high-precision visual teaching trajectories. For the demonstration video frame, the object state is extracted via a two-step pose estimation method, which allows to represent of task demonstration with an extracted trajectory. Some end-to-end pose estimation methods [Reference He, Sun, Huang, Liu, Fan and Sun34, Reference Lin, Wang, Ling, Tao and Yang35] are able to do so but with unacceptable accuracy. We improve the accuracy of visual demonstration trajectory through point cloud processing and optimization methods. First, we collect a set of artificial teaching data of the image and depth information by a binocular camera with structured light. We train a SoLo model for object recognition and segmentation, which can achieve fast instance segmentation. We get deep information about the mask of the target object. By joining the depth information, we can use the internal parameters of the camera to obtain the real point cloud. We preprocess the point cloud, including downsampling, radius filtering, and Euclidean clustering to improve the ICP as shown in Fig. 4. Compared to the end-to-end method, the 6D pose obtained by this method is closer to the actual state and easier for error analysis. The error of the position estimation method we use is about 2 mm.
In this work, we have used the OpenCV and PCL libraries to process the image and point cloud data. After generating the teaching trajectory, we need to convert the teaching trajectory from the image coordinate system to the world coordinate system by hand-eye calibration. We convert the pixel coordinates of the target to the world coordinate system through the internal and external parameters of the camera. The equation for the transformation is as follows:
where $u$ , $v$ is the pixel position in the image. $(u_{0}, v_{0})$ is the pixel position in the center of the image. $dx$ and $dy$ denote how long each column and row represents. $f$ is the focal length of the camera, which we often refer to as the internal parameter. $(x_{c}, y_{c}, z_{c})$ represents the position in the camera framework. $(x_{w}, y_{w}, z_{w})$ is the real position in the world. $\boldsymbol{R}$ is the rotation matrix, and $\boldsymbol{T}$ is the transfer vector between the camera coordinate system and the world coordinate system, which we often refer to as the external parameter. In the real world, we can extract the internal and external parameters of the camera by hand-eye calibration.
4.2. Reinforcement learning
There are a number of simulators developed in response to the rapid development of RL in robotics, such as Pybullet [Reference Coumans and Bai46], MuJoCo [Reference Todorov, Erez and Tassa47], and Isaac. We consider the requirements to meet the demands of our experiment. MuJoCo is chosen according to the comparison of different simulators [Reference Collins, Chand, Vanderkop and Howard48]. We also develop RL algorithms by using OpenAI Gym, which is utilized as a convenient toolkit for RL algorithms.
PPO has been illustrated to work in the field of continuous spatial control of robots, which is one of the commonly used on-policy RL methods for robot skill learning. Compared with the off-policy method such as DDPG, PPO has better stability and adaptability. We improve the PPO algorithm, and there are three modifications that significantly increase performance of the original PPO algorithm: 1) The experience data retains the demonstration and exploration data, and actions from the human are regarded as the agent’s own. 2) We have designed a simple and efficient training environment and a new shaped-reward function for the robot gear assembly task. 3) We added noise to the parameters and the actor’s output – thus we have a robust policy.
4.2.1. Architecture
We use an Actor-Critic style PPO algorithm with a policy network and a value network. Both of the neural networks have three linear hidden layers with 256, 128, and 64 units, respectively. The ReLU activation function is used in all hidden layers. The input state is three dimensional, the output action is three dimensional, and the output value is one dimensional. With a constraint of the clipped surrogate objective in the PPO algorithm, there would not be an excessively large policy update while training which helps to reduce policy forgetting. Due to this advantage of the PPO algorithm, the retrained network can better retain the experience from the pretraining model and generate a more general policy.
However, the shortcomings of traditional robot RL methods make it difficult for us to obtain a good assembly policy through RL directly, which affects its effectiveness and practicality in real-world applications. We introduce enlightenment learning incorporated via pretraining to overcome these shortcomings, which involves a combination of algorithmic advancements, improving data efficiency, balancing exploration and exploitation, accelerating training, and engineering rewards. Due to the fact that RL is a random exploration process, without pretraining the model, the trajectories and policy generated are unpredictable. The proposed training strategy can enable RL to obtain basic policy units in advance.
After transforming the demonstration data to the robot base coordinate system in the simulator, we can pretrain the policy network and the value network. First, all the demonstration data are transfered into $(s, a, r, s')$ and put into the experience pool. While training, 64 groups of data are randomly extracted from the experience pool in each training epoch. After training for 2000 epochs, we can obtain a pretrained policy network and value network. The pretrained policy network and value network are the same as the networks for RL, including an actor and a critic. In this way, we can initialize the RL model for further training. After pretrain the RL model, PPO will control the rest of the training.
4.2.2. Environment
There is a gap in applying RL to real-world robot tasks. For example, in real-world scenarios, unexpected actions may lead to potential safety issues for robots, and low sampling efficiency in the real world may lead to convergence difficulties in the training process. It is necessary to create a training environment that can achieve the transfer of simulation and reality. We build a new simulation environment that can maintain transfer-ability between domains and train the agent in the Cartesian space. Simulation results illustrate that it speeds up learning process and reduces training time greatly. We train an agent in Cartesian space using proprioceptive information. The training of Cartesian space skills can greatly simplify the model and reduce the amount of data required for training, which is beneficial to transfer to the real world. The state space and the action space are defined as follows:
where $(x_t, y_t, z_t)$ is the position of the end effector at $t$ . $(\delta x, \delta y, \delta z)$ is the displacement of the end effector. We also normalize the state space and action space. In this paper, we trained an assembly policy with PPO to obtain the action of the end effector based on the current state of the environment. The actions generated by the policy are in the Cartesian space. Therefore, we need to transform the actions from Cartesian space into joint space. We design a solver to settle the inverse kinematics problem of robots with PyKDL library and control the robot in joint space.
Considering the safety during robot assembly and also for more convenient training, we also define a mechanism for environment reset. The environment is reset when: 1) The robot joint angles out of limit; 2) the end effector is out of the safe workspace; and 3) the number of steps exceeds the set threshold. The threshold is as shown in Table I.
4.2.3. Shaped reward
The RL method is guided by reward signals. Instead of using a sparse reward, we well design a shaped reward function, which is beneficial for improving sample validity and faster convergence. We use sparse rewards to guide exploration while using dense rewards to provide more timely feedback. The sparse reward is defined as follows:
Sparse rewards make gradient propagation difficult, so it is necessary to design a dense reward function. Using a shaped reward function requires a large amount of engineering effort to extract the necessary state information. We compared different reward functions, including linear function, exponential function, gaussian function, and so on. These functions are simple but difficult to adapt to assembly tasks. We find that using a logarithmic function allows the environment to generate larger rewards when the target is closer. The reward function of the distance is defined as follows:
where $(x_{g}, y_{g}, z_{g})$ is the position of the gear and $(x_{s}, y_{s}, z_{s})$ is the position of the shaft. $d_t$ represents the distance between the current position and the target position at $t$ . $r_t$ is the reward at $t$ . $\epsilon$ is a constant used to prevent the case where the logarithmic function of the reward fails to take a value for a distance of zeros, $\epsilon$ is set to 0.005. $\upsilon$ is a bias constant of the reward, $\upsilon$ is set to 5.0. The environment is reset when $d_t \gt 0.5$ or when the joint angle of the robot is greater than the joint limit.
Besides, we define a successful attempt as the robot completes the assemble task at least 0.02 m. When the task is completed, the environment generates a reward $r_{done} = 1$ . In conclusion, the total reward is as follows:
We collect a set of demonstration data mentioned in Section 4.1 and store each transition $\left (s, a, r, s^{\prime }\right )$ to pretrain the RL model. In our framework, all the demonstration steps are completed in reality. After pretraining the RL model, PPO will control the rest of the training. Finally, we obtain a usable model to complete the assemble task.
4.3. Sim-to-real transfer
In order not to overly adapt to any specific absolute coordinate position, we introduced a mechanism that allows agents to extend to new positions and fill up the sim-to-real gap. Directly using policy in the real world will consistently hurt performance. In our regime, we have found that domain randomization and sim-to-real error correction improved performance of transference. We performed experiments in reality to demonstrate the ability of our training model to transfer from simulation to reality. We use two methods to do this: training in a randomized environment and filling up the sim-to-real gap with a sim-to-real controller. The results show the trained model robust enough to work in different domains.
4.3.1. Domain randomization
While training, we added noise to the parameters and the ground truth position. In our framework, domain randomization works well in error correction. We used two types of randomization: observation randomization and dynamic randomization. The observation randomization represents the uncertainty of the target and the initial state. Dynamic randomization represents the model’s inaccuracy while interacting with the environment. We used two types of noise distribution: Gaussian distribution $N_g\,(\mu, \rho )$ (where $\mu$ is the mean and $\rho$ is the covariance) and Uniform distribution $N_u\,(\mu, \rho )$ (where $\mu$ is the mean and $\rho$ is the absolute value of the upper and lower limits). The domain randomization configuration of the experiment is given in Table II.
4.3.2. Sim-to-real controller
It is necessary to create a control method that can achieve the transfer of simulation and reality. In this article, we successfully addressed this issue by designing an error estimator. The grasp planning algorithm used in this paper is the one proposed in reference [Reference Zhang, Li, Feng and Yang49]. Due to the error of the grasping position and object differences, there are differences between reality and simulation. In this paper, we assume that in the assemble task, the reality gap is caused by the different positions of the gear and shaft in the real world and simulation. So we design a sim-to-real controller to fill up the sim-to-real gap. We start by moving the robot in reality and simulation to the same initial state and then identify and calculate the state of the target in reality to calculate the error between reality and simulation. Before the policy starts, we have an initial estimate of the positions of the gears and shaft both in the real world and simulator. When the policy is executed in reality, we superimpose the acquired error on the policy output to control the robot in reality. The error is calculated as follows:
where $\boldsymbol{p}_{\boldsymbol{real}}, \boldsymbol{p}_{\boldsymbol{sim}}$ is the position of the assembly target in the initial state generated by the approach to estimate the 6D pose of the target mentioned in Section 4.1. $\boldsymbol{\Delta}\boldsymbol{p}$ represents the sim-to-real error. While performing tasks, the policy is running in the simulation, and the real robot is controlled by the state of the robot in the simulation combined with error.
where $\boldsymbol{q}$ is the joint angles and $\boldsymbol{J(q)}$ is the Jacobian matrix of the robot at the current moment in the simulator. $\boldsymbol{q}^{\boldsymbol\prime}$ is the real joint angles to control the real robot.
5. Experiments
There are two different sizes of gears used in the experiments in the NIST gear assembly benchmark. The proposed method is utilized to perform manipulation of different gear objects with no additional programing. In the experiments presented in this paper, the following questions are explored: 1) What is the effect of the expert experience added through the visual demonstration? 2) How robust are the skills to different scenes and uncertainty due to dynamics? 3) Does our framework work well in real-world tasks and robots?
5.1. Simulator environment
We set up the environment in the MuJoCo simulator, which can effectively reduce the reality gap and demonstrate the proposed method in the gear shaft assembly task. The experimental environment consists of three main parts: simulator, robot, and assembled objects. Figure 5 shows the entire system in the real world.
5.1.1. Simulator
MuJoCo is a general-purpose physics engine that has achieved good results and applications in robotics. It is able to quickly and accurately simulate the interaction between articulated structures and the environment. Multiple cameras are set in the simulation environment to obtain the grasping position and teaching trajectory, including the camera on the robot hand and the camera fixed to the rear of the robot.
5.1.2. Robot
We use UR3, a 6-DOF robot with a two-fingered gripper. In the simulator, we implement the joint control of the robot with the designed PD controller. Besides, controllers transform the high-level actions into low-level virtual motor commands that actuate the robots with the inverse kinematics of the robot. The URDF file of the robot is used to import the joint information of the robot, and the required robot end position is obtained to calculate the IK solution of the robot.
5.1.3. Assemble objects
As shown in Fig. 6, the gear assembly task consists of two main parts: the gear and the shaft. In this paper, two different gears are used, each mounted on a different shaft. The outer diameters of the gears are 40 and 30 mm, respectively, and both inner diameters are 10 mm. The diameters of the shafts are 8 mm.
5.2. Visual human demonstration
In this experiment, the COMATRIX camera was used to generate grayscale images, depth images, and pointclouds of the assemble target at the camera’s acquisition rate (1–2 Hz). We have collected 120 grayscale images and depth images. The data we collect from the camera of the human demonstration are as follows:
-
1) Grayscale image: 2448*2048 pixels (as shown in Fig. 7, the first row)
-
2) Depth image: 2448*2048 pixels (as shown in Fig. 7, the second row)
-
3) Pointcloud: 1,628,584 points (as shown in Fig. 7, the third row)
After processing the raw data of teaching, we can obtain the trajectory of teaching actions in the camera coordinate system. With hand-eye calibration, the transformation matrix and translation vectors from the camera to the robot base coordinate frame can be obtained so that we can map the trajectory to the robot base coordinate system. The trajectory extraction method proposed in this paper has high accuracy and can obtain trajectories with human experience. It is with an error of 2 mm, which meets the requirements of the gear assembly task. Then we control the robot to perform the extracted trajectory and compare with the trajectory in the video. As shown in Fig. 8, the robot can accurately reproduce the action trajectory and successfully complete assembly tasks taught by humans, which proves that our method can accurately extract trajectories with human experience. The demonstration trajectory also has a good average reward result.
5.3. Performance of reinforcement learning
Trajectory planning for robots using RL involves training a policy that guides the robot’s actions over time. In this context, the trajectory represents a sequence of states and actions that the robot takes to accomplish a task. The policy maps states to actions, effectively determining the robot’s behavior. During the execution phase, the learned policy is applied to generate trajectories in real-world scenarios. The robot leverages its acquired knowledge to navigate and adapt to the environment, making decisions at each step based on the learned policy. This approach allows the robot to plan and execute trajectories that align with its learned understanding of optimal actions in different states, showcasing the adaptability and intelligence achieved through RL.
5.3.1. Training results
We test the generalization ability of our framework under different conditions and show the smoothed episodic reward with the smooth function is $r_{t+1}'=0.4r_t' +0.6 r_{t+1}$ . The training mean reward curve results of the PPO agent are shown in Fig. 9. We trained the policy with 700 episodes and each lasting 64 steps under different conditions. It is clear that the agent converges to a policy that allows it to successfully complete the task after nearly 600 episodes without demonstration and domain randomization. The mean reward per episode of the pretrained policy is about −1.3 at the start, and after re-training, the mean reward will reach the top after nearly 200 episodes. The reward and average reward in both modes start at a lower value and rise to a stable value. This indicates that the robot has acquired environmental knowledge through training and is able to adapt to the environment.
To test the generalization capability of the learned policies, we ask the agent to perform gear assembly tasks at new locations using the previously trained policies without any fine-tuning. The result shows that the model obtained by training again can well retain the experience obtained by pretraining and optimize the policy obtained by pretraining. As shown in Fig. 10, the policy after pretraining outperforms the policy without demonstration or trained with demonstration only. The policy with demonstration allows for stable execution of assembly actions along the shaft. Without the demonstration, the trained policy may go closer to the target position from the side of the shaft, which will lead to the failure of the assembly task.
5.3.2. Comparison experiment
We conducted contrast experiments comparing our method with the original methods. As shown in Table III, the proposed method has the least training time, the highest success rate, and the least number of execution steps. We selected three models trained by different methods: the model trained by the original PPO algorithm, the pretrained skill model, and the model obtained by training again after pretraining. In the simulation, one hundred assembly tasks are carried out for each model in the same environment. The initial state of each assembly task is randomized. If the assembly strategy can be completed within 200 steps, the assembly is judged to be successful. The success rate of assembly and the average number of steps are calculated and repeated ten times. The result is shown in Fig. 11. The average success rate and average number of step knots for the three models tested are, respectively, 80% and 158 steps, 96% and 134 steps, and 97% and 90 steps. Policy trained using the proposed method has a significantly higher success rate and can use fewer steps to complete the assembly task.
We tested the policies obtained from different PPO training strategies and extracted 10 different trajectories for each strategy. We randomized the initial position of the assembly and set the target position to be the same (0.25 m, 0.0 m, and 0.06 m). The assembly trajectories are shown in Fig. 10, and the position changes in three directions results are shown in Fig. 12. The assembly policy trained using the method proposed in this paper generates shorter and smoother motion trajectories. As shown in Fig. 12, the original PPO training method is unable to successfully complete assembly tasks at some random initial positions. Two out of the 10 samples were unable to successfully reach the target position to complete the assembly action. After adding pretraining with demonstration data, the success rate and assembly speed of the training policy are significantly improved.
In addition to comparing with the original PPO algorithm, we also compared with another similar RL method, DDPGfD which is an off-policy method. The trajectory generated by PPO production is smoother than that generated by DDPG due to the design method of PPO algorithm loss function. The trajectory generated by DDPGfD may experience some jitter. As shown in Fig. 13, we compare and discuss the generated trajectory of the proposed method with the demonstration and DDPGfD in three directions: x-axis, y-axis, and z-axis. All generated policies perform the same task. The starting position of the assembly is approximately (0.3, 0.12, 0.3), and the target position is (0.25 m, 0.0 m, 0.06 m). The results showed that the strategy obtained through PPO and pretraining reached the target position the fastest.
We also compared the changes in rewards during the assembly process in Fig. 14. Obviously, our method enables faster assembly. It outperforms vendor solutions by large margins in terms of perturbation ranges while keeping high success rates. In total, we have performed 100 trials on each method with a random initial position. This strongly implies our method is robust and reliable. Most failures occur due to active hard contact can cause changing dynamics. In the future, we plan to look at ways to encourage a more gentle insertion policy.
5.4. Sim-to-real transfer
In gear assembly tasks, the main sim-to-real gap comes from the position error of the assembly target and the motion error of the robot. In this paper, we develop an error estimator to settle the sim-to-real gap. Five groups of 20 repeated experiments with and without error estimation were done. The results are shown in Table IV. With the error estimator, the success rate of sim-to-real transfer increased from 75% to 90% as shown in Fig. 15. As shown in Fig. 16, we compare and discuss the generated trajectory with and without the error estimator. There is a clear gap between reality and simulation. The real target position is (0.25 m, 0.0 m, 0.06 m). Through error estimation, we can effectively correct the assembly trajectory in reality to improve the success rate of real assembly.
Our experiment snapshot is shown in Fig. 8. We can see that our control method can well transfer the policy from the simulation to reality. The robot moves more slowly as it gets closer to the target. This is due to the setting of our reward function when the closer to the target is the smaller the variation of the reward, which helps to improve the safety of the assembly process.
6. Conclusion
In this paper, we proved that our framework is a novel approach to solving the assemble task both in simulated and real-world environments. We improved the PPO algorithm and compared it with the original method and other similar RL methods. The comparison results show that the proposed method can faster access to the assembly target with a higher success rate. With the visual demonstration, the PPO algorithm can generalize a more reliable and human-like policy and faster access to an optimal policy. Since there is a difference between reality and simulation, PPO modifies the domain parameters in the MuJoCo simulator while training. We also developed a general error monitoring method based on visual information to fill up the sim-to-real gap and increase the generalization of our framework. In this paper, PPO with visual demonstration and sim-to-real controller performs the gear assemble tasks well using UR3 robots with a higher success rate and fewer steps.
The approach contributes to overcoming the common limitations of data scarcity and high costs associated with real-world robot training. By combining the strengths of simulation-based learning, one-shot transfer policies, and the informative nature of visual demonstrations, the paper presents a pioneering solution to enhance the adaptability, efficiency, and performance of robotic assembly tasks. This research significantly advances the field by providing a comprehensive methodology that bridges the sim-to-real gap in robotic assembly through the incorporation of visual demonstrations and RL.
For future work, we will try to develop a more reliant and robust framework by joining the force information due to the significance of the force information in the assembly task. We will consider more assembly tasks and develop a more generalized framework for different assembly targets.
Author contributions
Ruihong Xiao helped in conceiving the study concept and design, acquiring the data, analyzing, interpreting the data, and drafting the manuscript. Yiming Jiang and Hui Zhang helped in analyzing and interpreting the data and drafting the manuscript. Chenguang Yang helped in conceiving the study concept and design, drafting and revising the manuscript, obtaining funding, and supervising the study. All authors have read and agreed to the published version of the manuscript.
Financial support
This work was supported in part by National Nature Science Foundation of China (NSFC) under Grant U20A20200 and Major Research Grant No. 92148204, in part by Guangdong Basic and Applied Basic Research Foundation under Grants 2019B1515120076 and 2020B1515120054, in part by Industrial Key Technologies R&D Program of Foshan under Grant 2020001006308 and Grant 2020001006496.
Competing interests
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential competing interests.
Ethical approval
None.
Supplementary material
The supplementary material for this article can be found at http://dx.doi.org/10.1017/S0263574724000092.