A multi-modal learning method for pick-and-place task based on human demonstration

Diqing Yu; Xinggang Fan; Yaonan Li; Heping Chen; Han Li; Yuao Jin

doi:10.1017/S0263574724001322

A multi-modal learning method for pick-and-place task based on human demonstration

Published online by Cambridge University Press: 12 December 2024

Yaonan Li ,

and

Diqing Yu: Affiliation:
Zhejiang University of Technology, Hangzhou, China
Xinggang Fan: Affiliation:
Zhejiang University of Technology, Hangzhou, China Shenzhen Academy of Robotics, Shenzhen, China
Yaonan Li*: Affiliation:
Shenzhen Academy of Robotics, Shenzhen, China
Heping Chen: Affiliation:
Shenzhen Academy of Robotics, Shenzhen, China
Han Li: Affiliation:
Zhejiang University of Technology, Hangzhou, China
Yuao Jin: Affiliation:
Shenzhen Academy of Robotics, Shenzhen, China
*: Corresponding author: Yaonan Li; Email: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Robot pick-and-place for unknown objects is still a very challenging research topic. This paper proposes a multi-modal learning method for robot one-shot imitation of pick-and-place tasks. This method aims to enhance the generality of industrial robots while reducing the amount of data and training costs the one-shot imitation method relies on. The method first categorizes human demonstration videos into different tasks, and these tasks are classified into six types to symbolize as many types of pick-and-place tasks as possible. Second, the method generates multi-modal prompts and finally predicts the action of the robot and completes the symbolic pick-and-place task in industrial production. A carefully curated dataset is created to complement the method. The dataset consists of human demonstration videos and instance images focused on real-world scenes and industrial tasks, which fosters adaptable and efficient learning. Experimental results demonstrate favorable success rates and loss results both in simulation environments and real-world experiments, confirming its effectiveness and practicality.

Keywords

Multi-modal robot agent one-shot imitation task learning action prediction real-world demonstration dataset

Type: Research Article
Information: Robotica , Volume 42 , Issue 10 , October 2024 , pp. 3302 - 3323

DOI: https://doi.org/10.1017/S0263574724001322 [Opens in a new window]
Copyright: © The Author(s), 2024. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. and Polosukhin, I., “Attention is all you need,” Adv. Neur. Inf. Process. Syst. 30, 5998–6008 (2017).Google Scholar

Katz, D. M., Bommarito, M. J., Gao, S. and Arredondo, P., “GPT-4 passes the bar exam,” Philos. Trans. R. Soc. A: Math. Phys. Eng. Sci. 382(2270), 20230254 (2024).CrossRef Google Scholar PubMed

Waisberg, E., Ong, J., Masalkhi, M., Kamran, S. A., Zaman, N., Sarker, P., Lee, A. G. and Tavakkoli, A., “GPT-4: A new era of artificial intelligence in medicine,” Irish J. Med. Sci. (1971-) 192(6), 3197–3200 (2023).CrossRef Google Scholar PubMed

Wang, S., Zhou, Z., Li, B., Li, Z. and Kan, Z., “Multi-modal interaction with transformers: Bridging robots and human with natural language,” Robotica 42(2), 415–434 (2024).CrossRef Google Scholar

Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., A. Phanishayee and M. Zaharia, “Efficient Large-scale Language Model Training on GPU Clusters Using Megatron-LM,” In: Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis (2021) pp. 1–15.Google Scholar

Yuan, S., Zhao, H., Du, Z., Ding, M., Liu, X., Cen, Y., Zou, X., Yang, Z. and Tang, J., “Wudaocorpora: A super large-scale Chinese corpora for pre-training language models,” AI Open 2, 65–68 (2021).CrossRef Google Scholar

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E, Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov and N. Fiedel, “Palm: Scaling language modeling with pathways,” J. Mach. Learn. Res. 24(240), 1–113 (2023).Google Scholar

Kodagoda, S., Sehestedt, S. and Dissanayake, G., “Socially aware path planning for mobile robots,” Robotica 34(3), 513–526 (2016).CrossRef Google Scholar

Feng, Z., Xue, B., Wang, C. and Zhou, F., “Safe and socially compliant robot navigation in crowds with fast-moving pedestrians via deep reinforcement learning,” Robotica 42(4), 1212–1230 (2024).CrossRef Google Scholar

Park, E. and Lee, J., “I am a warm robot: The effects of temperature in physical human–robot interaction,” Robotica 32(1), 133–142 (2014).CrossRef Google Scholar

Kansal, S. and Mukherjee, S., “Vision-based kinematic analysis of the Delta robot for object catching,” Robotica 40(6), 2010–2030 (2022).CrossRef Google Scholar

Zubair, M., Kansal, S. and Mukherjee, S., “Vision-based pose estimation of craniocervical region: Experimental setup and saw bone-based study,” Robotica 40(6), 2031–2046 (2022).CrossRef Google Scholar

Jana, S., Tony, L. A., Bhise, A. A., V., V. P. and Ghose, D., “Interception of an aerial manoeuvring target using monocular vision,” Robotica 40(12), 4535–4554 (2022).CrossRef Google Scholar

Marwan, Q. M., Chua, S. C. and Kwek, L. C., “Comprehensive review on reaching and grasping of objects in Robotics,” Robotica 39(10), 1849–1882 (2021).CrossRef Google Scholar

Gujran, S. S. and Jung, M. M., “multi-modal prompts effectively elicit robot-initiated social touch interactions,” In: Companion Publication of the 25th International Conference on multi-modal Interaction (2023) pp. 159–163.Google Scholar

Lin, V., Yeh, H., Huang, H. and Chen, N., “Enhancing EFL vocabulary learning with multi-modal cues supported by an educational robot and an IoT-based 3D book,” System 104, 102691 (2022).CrossRef Google Scholar

Lee, Y., Tsai, Y., Chiu, W. and Lee, C., “Multi-Modal Prompting With Missing Modalities for Visual Recognition,” In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023) pp. 14943–14952.Google Scholar

Kaindl, H., Falb, J. and Bogdan, C., “multi-modal Communication Involving Movements of a Robot,” In: CHI’08 Extended Abstracts on Human Factors in Computing Systems. Association for Computing Machinery (2008) pp. 3213–3218.Google Scholar

Danny, D., F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y. Chebotar, P. Sermanet, D. Duckworth, S. Levine, V. Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch and P. Florence, “PaLM-E: An embodied multimodal language model,” Int. Conf. Mach. Learn. PMLR, 8469–8488 (2023).Google Scholar

Lynch, C., Wahid, A., Tompson, J., Ding, T., Betker, J., Baruch, R., Armstrong, T. and Florence, P., “Interactive language: Talking to robots in real time,” IEEE Robot. Autom. Lett. 8, 1659–1666 (2023).Google Scholar

Duan, Y., Andrychowicz, M., Stadie, B., Ho, J., Schneider, J., Sutskever, I., Abbeel, P. and Zaremba, W., “One-shot imitation learning,” Adv. Neur. Inf. process. syst. 30, 1087–1098 (2017).Google Scholar

Bonardi, A., James, S. and Davison, A. J., “Learning one-shot imitation from humans without humans,” IEEE Robot. Autom. Lett. 5(2), 3533–3539 (2020).CrossRef Google Scholar

Huang, D. A., Xu, D., Zhu, Y., Garg, A., Savarese, S., Fei-Fei, L. and Niebles, J. C., “Continuous Relaxation of Symbolic Planner for One-Shot Imitation Learning,” In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2019) pp. 2635–2642.Google Scholar

Mandi, Z., Liu, F., Lee, K. and Abbeel, P., “Towards More Generalizable One-Shot Visual Imitation Learning,” In: International Conference on Robotics and Automation (ICRA) (2022) pp. 2434–2444.Google Scholar

Rahmatizadeh, R., Abolghasemi, P., Bölöni, L. and Levine, S.. ”Vision-Based Multi-Task Manipulation for Inexpensive Robots Using End-to-End Learning from Demonstration,” In: 2018 IEEE International Conference on Robotics and Automation (ICRA) (2018) pp. 3758–3765.Google Scholar

Tamizi, M. G., Honari, H., Nozdryn-Plotnicki, A. and Najjaran, H., “End-to-end deep learning-based framework for path planning and collision checking: Bin-picking application,” Robotica 42(4), 1094–1112 (2024).CrossRef Google Scholar

Post, M. A., “Probabilistic robotic logic programming with hybrid Boolean and Bayesian inference,” Robotica 42(1), 40–71 (2024).CrossRef Google Scholar

Finn, C., Yu, T., Zhang, T., Abbeel, P. and Levine, S., “One-Shot Visual Imitation Learning Via Meta-Learning,” In: Conference on Robot Learning (2017) pp. 357–368.Google Scholar

Jiang, Y., A. Gupta and Z. Zhang, “Vima: General Robot Manipulation with multi-modal Prompts,” In: NeurIPS. 2022 Foundation Models for Decision Making Workshop (2022).Google Scholar

Nguyen, A., Nguyen, N., Tran, K., Tjiputra, E. and Tran, Q. D., “Autonomous Navigation in Complex Environments with Deep Multi-Modal Fusion Network,” In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) ( 2020) pp. 5824–5830.CrossRef Google Scholar

Noda, K., Arie, H., Suga, Y. and Ogata, T., “multimodal integration learning of robot behavior using deep neural networks,” Robot. Auton. Syst. 62(6), 721–736 (2014).CrossRef Google Scholar

Xue, T., Wang, W., Ma, J., Liu, W., Pan, Z. and Han, M., “Progress and prospects of multi-modal fusion methods in physical human–robot interaction: A review,” IEEE Sens. J. 20(18), 10355–10370 (2020).CrossRef Google Scholar

Ravichandar, H., Polydoros, A. S., Chernova, S. and Billard, S., “Recent advances in robot learning from demonstration,” Annu. Rev. Control Robot. Auton. Syst. 3(1), 297–330 (2020).CrossRef Google Scholar

Nicolescu, M. N. and Mataric, M. J., “Natural Methods for Robot Task Learning: Instructive Demonstrations, Generalization and Practice,” In: Proceedings of the second international joint conference on Autonomous agents and multiagent systems (2003) pp. 241–248.Google Scholar

Dasari, S. and Gupta, A., “Transformers for One-Shot Visual Imitation,” In: Conference on Robot Learning. PMLR (2021) pp. 2071–2084.Google Scholar

Bertasius, G., Wang, H. and Torresani, L., “Is space-time attention all you need for video understanding?,” Int. Conf. Mach. Learn. 2(3), 4 (2021).Google Scholar

Soomro, K., Zamir, A. R. and Shah, M., “UCF101: A dataset of 101 human actions classes from videos in the wild (2012). arXiv preprint arXiv: 1212.0402.Google Scholar

Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C. L., “Microsoft COCO: Common Objects in Context,” In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland (2014) pp. 740–755. Proceedings, Part V 13.Google Scholar

Dosovitskiy, A., L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale[J] (2020). arXiv preprint arXiv: 2010.11929.Google Scholar

Carpin, S., Lewis, M., Wang, J., Balakirsky, J. and Scrapper, C., “Bridging the Gap Between Simulation and Reality in Urban Search and Rescue,” In: RoboCup 2006: Robot Soccer World Cup X. 10 (2007) pp. 1–12.Google Scholar

Hua, J., Zeng, L., Li, G. and Ju, Z., “Learning for a robot: Deep reinforcement learning, imitation learning, transfer learning,” Sensor, 21(4), 1278 (2021).CrossRef Google Scholar PubMed

Singh, B., Kumar, R. and Singh, V. P., “Reinforcement learning in robotic applications: A comprehensive survey,” Artif. Intell. Rev. 55(2), 945–990 (2022).CrossRef Google Scholar

Talaat, F. M. and ZainEldin, H., “An improved fire detection approach based on YOLO-v8 for smart cities,” Neural Comput. Appl. 35(4), 20939–20954 (2023).CrossRef Google Scholar

Article contents

A multi-modal learning method for pick-and-place task based on human demonstration

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests