Hostname: page-component-78c5997874-s2hrs Total loading time: 0 Render date: 2024-11-05T01:06:14.408Z Has data issue: false hasContentIssue false

Pre-training with non-expert human demonstration for deep reinforcement learning

Published online by Cambridge University Press:  26 July 2019

Gabriel V. de la Cruz Jr
Affiliation:
School of Electrical Engineering and Computer Science, Washington State University, Pullman, Washington 99164-2752, USA e-mails: [email protected], [email protected], [email protected]
Yunshu Du
Affiliation:
School of Electrical Engineering and Computer Science, Washington State University, Pullman, Washington 99164-2752, USA e-mails: [email protected], [email protected], [email protected]
Matthew E. Taylor
Affiliation:
School of Electrical Engineering and Computer Science, Washington State University, Pullman, Washington 99164-2752, USA e-mails: [email protected], [email protected], [email protected]

Abstract

Deep reinforcement learning (deep RL) has achieved superior performance in complex sequential tasks by using deep neural networks as function approximators to learn directly from raw input images. However, learning directly from raw images is data inefficient. The agent must learn feature representation of complex states in addition to learning a policy. As a result, deep RL typically suffers from slow learning speeds and often requires a prohibitively large amount of training time and data to reach reasonable performance, making it inapplicable to real-world settings where data are expensive. In this work, we improve data efficiency in deep RL by addressing one of the two learning goals, feature learning. We leverage supervised learning to pre-train on a small set of non-expert human demonstrations and empirically evaluate our approach using the asynchronous advantage actor-critic algorithms in the Atari domain. Our results show significant improvements in learning speed, even when the provided demonstration is noisy and of low quality.

Type
Adaptive and Learning Agents
Copyright
© Cambridge University Press, 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abtahi, F. & Fasel, I. 2011. Deep belief nets as function approximators for reinforcement learning, Restricted Boltzmann Machine (RBM) 2, h3.Google Scholar
Anderson, C. W., Lee, M. & Elliott, D. L. 2015. Faster reinforcement learning after pretraining deep networks to predict state dynamics. In 2015 International Joint Conference on Neural Networks (IJCNN), 17. IEEE.Google Scholar
Argall, B. D., Chernova, S., Veloso, M. & Browning, B. 2009. A survey of robot learning from demonstration. Robotics and Autonomous Systems 57(5), 469483.CrossRefGoogle Scholar
Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. 2013. The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research 47, 253279.CrossRefGoogle Scholar
Bojarski, M., Yeres, P., Choromanska, A., Choromanski, K., Firner, B., Jackel, L. & Muller, U. 2017. Explaining how a deep neural network trained with end-to-end learning steers a car. arXiv preprint arXiv:1704.07911.Google Scholar
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. & Zaremba, W. 2016. Openai gym.Google Scholar
Brys, T., Harutyunyan, A., Taylor, M. E. & Nowé, A. 2015. Policy transfer using reward shaping. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, 181188. International Foundation for Autonomous Agents and Multiagent Systems.Google Scholar
Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S. & Amodei, D. 2017. Deep reinforcement learning from human preferences. In NIPS, Curran Associates, Inc.Google Scholar
Deng, Y., Bao, F., Kong, Y., Ren, Z. & Dai, Q. 2017. Deep direct reinforcement learning for financial signal representation and trading. IEEE Transactions on Neural Networks and Learning Systems 28(3), 653664.CrossRefGoogle ScholarPubMed
Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y. & Zhokhov, P. 2017. Openai baselines. https://github.com/openai/baselines.Google Scholar
Du, Y., Czarnecki, W. M., Jayakumar, S. M., Pascanu, R. & Lakshminarayanan, B. 2018. Adapting auxiliary losses using gradient similarity. arXiv preprint arXiv:1812.02224.Google Scholar
Du, Y., de la Cruz, G. V. Jr., Irwin, J. & Taylor, M. E. 2016. Initial progress in transfer for deep reinforcement learning algorithms. In Proceedings of the Deep Reinforcement Learning: Frontiers and Challenges (DeepRL) Workshop (at IJCAI 2016).Google Scholar
Duan, Y., Chen, X., Houthooft, R., Schulman, J. & Abbeel, P. 2016. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, 13291338, JMLR.org.Google Scholar
Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P. & Bengio, S. 2010. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research 11, 625660.Google Scholar
Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S. & Vincent, P. 2009. The difficulty of training deep architectures and the effect of unsupervised pre-training. In Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), 153160, PMLR.Google Scholar
Glatt, R., d. Silva, F. L. & Costa, A. H. R. 2016. Towards knowledge transfer in deep reinforcement learning. In 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), 9196, IEEE.CrossRefGoogle Scholar
Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., Leibo, J. Z. & Gruslys, A. 2018. Deep q-learning from demonstrations. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, AAAI Press.Google Scholar
Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D. & Kavukcuoglu, K. 2017. Reinforcement learning with unsupervised auxiliary tasks. In ICLR, OpenReview.net.Google Scholar
Kempka, M., Wydmuch, M., Runc, G., Toczek, J. & Jaśkowski, W. 2016. Vizdoom: a Doom-based AI research platform for visual reinforcement learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), 18. IEEE.Google Scholar
Kurin, V., Nowozin, S., Hofmann, K., Beyer, L. & Leibe, B. 2017. The Atari grand challenge dataset. arXiv preprint arXiv:1705.10998.Google Scholar
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. & Wierstra, D. 2016. Continuous control with deep reinforcement learning. In ICLR, OpenReview.net.Google Scholar
Lin, M., Chen, Q. & Yan, S. 2013. Network in network. arXiv preprint arXiv:1312.4400.Google Scholar
Miotto, R., Wang, F., Wang, S., Jiang, X. & Dudley, J. T. 2017. Deep learning for healthcare: review, opportunities and challenges. Briefings in Bioinformatics 19(6), 12361246.CrossRefGoogle Scholar
Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A. J., Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavukcuoglu, K., Kumaran, D. & Hadsell, R. 2017. Learning to navigate in complex environments. In ICLR, OpenReview.net.Google Scholar
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D. & Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 1928–1937, JMLR.org.Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S. & Hassabis, D. 2015. Human-level control through deep reinforcement learning. Nature 518(7540), 529533, Nature Publishing Group.CrossRefGoogle ScholarPubMed
Ng, A. Y., Harada, D. & Russell, S. 1999. Policy invariance under reward transformations: theory and application to reward shaping. In ICML, 99, 278287, Morgan Kaufmann.Google Scholar
Pan, S. J. & Yang, Q. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22(10), 13451359.CrossRefGoogle Scholar
Papoudakis, G., Chatzidimitriou, K. C. & Mitkas, P. A. 2018. Deep reinforcement learning for Doom using unsupervised auxiliary tasks. CoRR abs/1807.01960.Google Scholar
Parisotto, E., Ba, J. L. & Salakhutdinov, R. 2016. Actor-mimic: deep multitask and transfer reinforcement learning. In ICLR, OpenReview.net.Google Scholar
Pohlen, T., Piot, B., Hester, T., Azar, M. G., Horgan, D., Budden, D., Barth-Maron, G., van Hasselt, H., Quan, J., Večerík, M., Hessel, M., Munos, R. & Pietquin, O. 2018. Observe and look further: achieving consistent performance on Atari. arXiv preprint arXiv:1805.11593.Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Alexander, C. B. & Fei-Fei, L. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115(3), 211252.CrossRefGoogle Scholar
Rusu, A. A., Colmenarejo, S. G., Gulcehre, C., Desjardins, G., Kirkpatrick, J., Pascanu, R., Mnih, V., Kavukcuoglu, K. & Hadsell, R. 2016, Policy distillation. In ICLR, OpenReview.net.Google Scholar
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D. & Batra, D. 2017. Grad-cam: visual explanations from deep networks via gradient-based localization. In ICCV, 618626, IEEE.Google Scholar
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T. & Hassabis, D. 2016. Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484489.CrossRefGoogle ScholarPubMed
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K. & Hassabis, D. 2018. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419), 11401144.CrossRefGoogle ScholarPubMed
Sutton, R. S. & Barto, A. G. 2018. Reinforcement Learning: An Introduction. MIT Press.Google Scholar
Taylor, M. E. & Stone, P. 2009. Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10(Jul), 16331685.Google Scholar
Teh, Y., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N. & Pascanu, R. 2017. Distral: robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, 44964506.Google Scholar
Tieleman, T. & Hinton, G. 2012. Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4(2), 2631.Google Scholar
Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J., Schrittwieser, J., Quan, J., Gaffney, S., Petersen, S., Simonyan, K., Schaul, T., van Hasselt, H., Silver, D., Lillicrap, T., Calderone, K., Keet, P., Brunasso, A., Lawrence, D., Ekermo, A., Repp, J. & Tsing, R. 2017. Starcraft II: a new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782.Google Scholar
Wang, Z. & Taylor, M. E. 2017. Improving reinforcement learning with confidence-based demonstrations. In Proceedings of the 26th International Conference on Artificial Intelligence (IJCAI), International Joint Conference on Artificial Intelligence.CrossRefGoogle Scholar
Watkins, C. J. & Dayan, P. 1992. Q-learning. Machine Learning 8(3–4), 279292.CrossRefGoogle Scholar
Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. 2014. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, 33203328.Google Scholar
Zhang, Y., Lee, K. & Lee, H. 2016. Augmenting supervised neural networks with unsupervised objectives for large-scale image classification. In ICML, JMLR.org.Google Scholar