Hostname: page-component-669899f699-qzcqf Total loading time: 0 Render date: 2025-04-24T14:37:57.023Z Has data issue: false hasContentIssue false

A multi-modal learning method for pick-and-place task based on human demonstration

Published online by Cambridge University Press:  12 December 2024

Diqing Yu
Affiliation:
Zhejiang University of Technology, Hangzhou, China
Xinggang Fan
Affiliation:
Zhejiang University of Technology, Hangzhou, China Shenzhen Academy of Robotics, Shenzhen, China
Yaonan Li*
Affiliation:
Shenzhen Academy of Robotics, Shenzhen, China
Heping Chen
Affiliation:
Shenzhen Academy of Robotics, Shenzhen, China
Han Li
Affiliation:
Zhejiang University of Technology, Hangzhou, China
Yuao Jin
Affiliation:
Shenzhen Academy of Robotics, Shenzhen, China
*
Corresponding author: Yaonan Li; Email: [email protected]

Abstract

Robot pick-and-place for unknown objects is still a very challenging research topic. This paper proposes a multi-modal learning method for robot one-shot imitation of pick-and-place tasks. This method aims to enhance the generality of industrial robots while reducing the amount of data and training costs the one-shot imitation method relies on. The method first categorizes human demonstration videos into different tasks, and these tasks are classified into six types to symbolize as many types of pick-and-place tasks as possible. Second, the method generates multi-modal prompts and finally predicts the action of the robot and completes the symbolic pick-and-place task in industrial production. A carefully curated dataset is created to complement the method. The dataset consists of human demonstration videos and instance images focused on real-world scenes and industrial tasks, which fosters adaptable and efficient learning. Experimental results demonstrate favorable success rates and loss results both in simulation environments and real-world experiments, confirming its effectiveness and practicality.

Type
Research Article
Copyright
© The Author(s), 2024. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. and Polosukhin, I., “Attention is all you need,” Adv. Neur. Inf. Process. Syst. 30, 5998–6008 (2017).Google Scholar
Katz, D. M., Bommarito, M. J., Gao, S. and Arredondo, P., “GPT-4 passes the bar exam,” Philos. Trans. R. Soc. A: Math. Phys. Eng. Sci. 382(2270), 20230254 (2024).CrossRefGoogle ScholarPubMed
Waisberg, E., Ong, J., Masalkhi, M., Kamran, S. A., Zaman, N., Sarker, P., Lee, A. G. and Tavakkoli, A., “GPT-4: A new era of artificial intelligence in medicine,” Irish J. Med. Sci. (1971-) 192(6), 31973200 (2023).CrossRefGoogle ScholarPubMed
Wang, S., Zhou, Z., Li, B., Li, Z. and Kan, Z., “Multi-modal interaction with transformers: Bridging robots and human with natural language,” Robotica 42(2), 415434 (2024).CrossRefGoogle Scholar
Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., A. Phanishayee and M. Zaharia, “Efficient Large-scale Language Model Training on GPU Clusters Using Megatron-LM,” In: Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis (2021) pp. 115.Google Scholar
Yuan, S., Zhao, H., Du, Z., Ding, M., Liu, X., Cen, Y., Zou, X., Yang, Z. and Tang, J., “Wudaocorpora: A super large-scale Chinese corpora for pre-training language models,” AI Open 2, 6568 (2021).CrossRefGoogle Scholar
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E, Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov and N. Fiedel, “Palm: Scaling language modeling with pathways,” J. Mach. Learn. Res. 24(240), 1113 (2023).Google Scholar
Kodagoda, S., Sehestedt, S. and Dissanayake, G., “Socially aware path planning for mobile robots,” Robotica 34(3), 513526 (2016).CrossRefGoogle Scholar
Feng, Z., Xue, B., Wang, C. and Zhou, F., “Safe and socially compliant robot navigation in crowds with fast-moving pedestrians via deep reinforcement learning,” Robotica 42(4), 12121230 (2024).CrossRefGoogle Scholar
Park, E. and Lee, J., “I am a warm robot: The effects of temperature in physical human–robot interaction,” Robotica 32(1), 133142 (2014).CrossRefGoogle Scholar
Kansal, S. and Mukherjee, S., “Vision-based kinematic analysis of the Delta robot for object catching,” Robotica 40(6), 20102030 (2022).CrossRefGoogle Scholar
Zubair, M., Kansal, S. and Mukherjee, S., “Vision-based pose estimation of craniocervical region: Experimental setup and saw bone-based study,” Robotica 40(6), 20312046 (2022).CrossRefGoogle Scholar
Jana, S., Tony, L. A., Bhise, A. A., V., V. P. and Ghose, D., “Interception of an aerial manoeuvring target using monocular vision,” Robotica 40(12), 45354554 (2022).CrossRefGoogle Scholar
Marwan, Q. M., Chua, S. C. and Kwek, L. C., “Comprehensive review on reaching and grasping of objects in Robotics,” Robotica 39(10), 18491882 (2021).CrossRefGoogle Scholar
Gujran, S. S. and Jung, M. M., “multi-modal prompts effectively elicit robot-initiated social touch interactions,” In: Companion Publication of the 25th International Conference on multi-modal Interaction (2023) pp. 159163.Google Scholar
Lin, V., Yeh, H., Huang, H. and Chen, N., “Enhancing EFL vocabulary learning with multi-modal cues supported by an educational robot and an IoT-based 3D book,” System 104, 102691 (2022).CrossRefGoogle Scholar
Lee, Y., Tsai, Y., Chiu, W. and Lee, C., “Multi-Modal Prompting With Missing Modalities for Visual Recognition,” In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023) pp. 1494314952.Google Scholar
Kaindl, H., Falb, J. and Bogdan, C., “multi-modal Communication Involving Movements of a Robot,” In: CHI’08 Extended Abstracts on Human Factors in Computing Systems. Association for Computing Machinery (2008) pp. 32133218.Google Scholar
Danny, D., F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y. Chebotar, P. Sermanet, D. Duckworth, S. Levine, V. Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch and P. Florence, “PaLM-E: An embodied multimodal language model,” Int. Conf. Mach. Learn. PMLR, 8469–8488 (2023).Google Scholar
Lynch, C., Wahid, A., Tompson, J., Ding, T., Betker, J., Baruch, R., Armstrong, T. and Florence, P., “Interactive language: Talking to robots in real time,” IEEE Robot. Autom. Lett. 8, 1659–1666 (2023).Google Scholar
Duan, Y., Andrychowicz, M., Stadie, B., Ho, J., Schneider, J., Sutskever, I., Abbeel, P. and Zaremba, W., “One-shot imitation learning,” Adv. Neur. Inf. process. syst. 30, 1087–1098 (2017).Google Scholar
Bonardi, A., James, S. and Davison, A. J., “Learning one-shot imitation from humans without humans,” IEEE Robot. Autom. Lett. 5(2), 35333539 (2020).CrossRefGoogle Scholar
Huang, D. A., Xu, D., Zhu, Y., Garg, A., Savarese, S., Fei-Fei, L. and Niebles, J. C., “Continuous Relaxation of Symbolic Planner for One-Shot Imitation Learning,” In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2019) pp. 26352642.Google Scholar
Mandi, Z., Liu, F., Lee, K. and Abbeel, P., “Towards More Generalizable One-Shot Visual Imitation Learning,” In: International Conference on Robotics and Automation (ICRA) (2022) pp. 24342444.Google Scholar
Rahmatizadeh, R., Abolghasemi, P., Bölöni, L. and Levine, S.. ”Vision-Based Multi-Task Manipulation for Inexpensive Robots Using End-to-End Learning from Demonstration,” In: 2018 IEEE International Conference on Robotics and Automation (ICRA) (2018) pp. 37583765.Google Scholar
Tamizi, M. G., Honari, H., Nozdryn-Plotnicki, A. and Najjaran, H., “End-to-end deep learning-based framework for path planning and collision checking: Bin-picking application,” Robotica 42(4), 10941112 (2024).CrossRefGoogle Scholar
Post, M. A., “Probabilistic robotic logic programming with hybrid Boolean and Bayesian inference,” Robotica 42(1), 4071 (2024).CrossRefGoogle Scholar
Finn, C., Yu, T., Zhang, T., Abbeel, P. and Levine, S., “One-Shot Visual Imitation Learning Via Meta-Learning,” In: Conference on Robot Learning (2017) pp. 357368.Google Scholar
Jiang, Y., A. Gupta and Z. Zhang, “Vima: General Robot Manipulation with multi-modal Prompts,” In: NeurIPS. 2022 Foundation Models for Decision Making Workshop (2022).Google Scholar
Nguyen, A., Nguyen, N., Tran, K., Tjiputra, E. and Tran, Q. D., “Autonomous Navigation in Complex Environments with Deep Multi-Modal Fusion Network,” In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) ( 2020) pp. 58245830.CrossRefGoogle Scholar
Noda, K., Arie, H., Suga, Y. and Ogata, T., “multimodal integration learning of robot behavior using deep neural networks,” Robot. Auton. Syst. 62(6), 721736 (2014).CrossRefGoogle Scholar
Xue, T., Wang, W., Ma, J., Liu, W., Pan, Z. and Han, M., “Progress and prospects of multi-modal fusion methods in physical human–robot interaction: A review,” IEEE Sens. J. 20(18), 1035510370 (2020).CrossRefGoogle Scholar
Ravichandar, H., Polydoros, A. S., Chernova, S. and Billard, S., “Recent advances in robot learning from demonstration,” Annu. Rev. Control Robot. Auton. Syst. 3(1), 297330 (2020).CrossRefGoogle Scholar
Nicolescu, M. N. and Mataric, M. J., “Natural Methods for Robot Task Learning: Instructive Demonstrations, Generalization and Practice,” In: Proceedings of the second international joint conference on Autonomous agents and multiagent systems (2003) pp. 241248.Google Scholar
Dasari, S. and Gupta, A., “Transformers for One-Shot Visual Imitation,” In: Conference on Robot Learning. PMLR (2021) pp. 20712084.Google Scholar
Bertasius, G., Wang, H. and Torresani, L., “Is space-time attention all you need for video understanding?,” Int. Conf. Mach. Learn. 2(3), 4 (2021).Google Scholar
Soomro, K., Zamir, A. R. and Shah, M., “UCF101: A dataset of 101 human actions classes from videos in the wild (2012). arXiv preprint arXiv: 1212.0402.Google Scholar
Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C. L., “Microsoft COCO: Common Objects in Context,” In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland (2014) pp. 740755. Proceedings, Part V 13.Google Scholar
Dosovitskiy, A., L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale[J] (2020). arXiv preprint arXiv: 2010.11929.Google Scholar
Carpin, S., Lewis, M., Wang, J., Balakirsky, J. and Scrapper, C., “Bridging the Gap Between Simulation and Reality in Urban Search and Rescue,” In: RoboCup 2006: Robot Soccer World Cup X. 10 (2007) pp. 112.Google Scholar
Hua, J., Zeng, L., Li, G. and Ju, Z., “Learning for a robot: Deep reinforcement learning, imitation learning, transfer learning,” Sensor, 21(4), 1278 (2021).CrossRefGoogle ScholarPubMed
Singh, B., Kumar, R. and Singh, V. P., “Reinforcement learning in robotic applications: A comprehensive survey,” Artif. Intell. Rev. 55(2), 945990 (2022).CrossRefGoogle Scholar
Talaat, F. M. and ZainEldin, H., “An improved fire detection approach based on YOLO-v8 for smart cities,” Neural Comput. Appl. 35(4), 2093920954 (2023).CrossRefGoogle Scholar