Computational Cognitive Models of Reinforcement Learning

doi:10.1017/9781108755610.026

22 - Computational Cognitive Models of Reinforcement Learning

from Part III - Computational Modeling of Basic Cognitive Functionalities

Published online by Cambridge University Press: 21 April 2023

Kenji Doya

Edited by

Ron Sun

Show author details

Ron Sun: Affiliation:
Rensselaer Polytechnic Institute, New York

Book contents

Get access

Summary

This chapter first reviews advanced methods in reinforcement learning (RL), namely, hierarchical RL, distributional RL, meta-RL, RL as inference, inverse RL, and multi-agent RL. Computational and cognitive models based on reinforcement learning are then presented, including detailed models of the basal ganglia, variety of dopamine neuron responses, roles of serotonin and other neuromodulators, intrinsic reward and motivation, neuroeconomics, and computational psychiatry.

Keywords

reinforcement learning (RL)hierarchical RL distributional RL meta-RL RL as inference inverse RL multi-agent RL basal ganglia dopamine serotonin neuromodulators intrinsic reward neuroeconomics computational psychiatry

Type: Chapter
Information: The Cambridge Handbook of Computational Cognitive Sciences , pp. 739 - 766

DOI: https://doi.org/10.1017/9781108755610.026 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2023

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Abbeel, P., & Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In 21st International Conference on Machine Learning, Banff, Canada.Google Scholar

Alexander, G. E., & Crutcher, M. D. (1990). Functional architecture of basal ganglia circuits: neural substrates of parallel processing. Trends in Neuroscience, 13, 266–271. https://doi.org/10.1016/0166-2236(90)90107-LCrossRef Google Scholar PubMed

Ardiel, E. L., & Rankin, C. H. (2010). An elegant mind: learning and memory in Caenorhabditis elegans. Learning and Memory, 17(4), 191–201. https://doi.org/10.1101/lm.960510 Google Scholar

Aston-Jones, G., & Cohen, J. D. (2005). An integrative theory of locus coeruleus-norepinephrine function: adaptive gain and optimal performance. Annual Reviews in Neuroscience, 28, 403–450. https://doi.org/10.1146/annurev.neuro.28.061604.135709 Google Scholar

Bacon, P.-L., Harb, J., & Precup, D. (2017). The option-critic architecture. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17).CrossRef Google Scholar

Baker, C. L., Saxe, R., & Tenenbaum, J. B. (2009). Action understanding as inverse planning. Cognition, 113(3), 329–349. https://doi.org/10.1016/j.cognition.2009.07.005 Google Scholar

Balleine, B. W., Dezfouli, A., Ito, M., & Doya, K. (2015). Hierarchical control of goal-directed action in the cortical–basal ganglia network. Current Opinion in Behavioral Sciences, 5, 1–7. https://doi.org/10.1016/j.cobeha.2015.06.001 CrossRef Google Scholar

Barreto, A., Hou, S., Borsa, D., Silver, D., & Precup, D. (2020). Fast reinforcement learning with generalized policy updates. Proceedings of the National Academy of Sciences (online). https://doi.org/10.1073/pnas.1907370117 CrossRef Google Scholar

Bavard, S., Lebreton, M., Khamassi, M., Coricelli, G., & Palminteri, S. (2018). Reference-point centering and range-adaptation enhance human reinforcement learning at the cost of irrational preferences. Nature Communications, 9(1), 4503. https://doi.org/10.1038/s41467-018-06781-2 CrossRef Google Scholar PubMed

Behrens, T. E., Woolrich, M. W., Walton, M. E., & Rushworth, M. F. (2007). Learning the value of information in an uncertain world. Nature Neuroscience, 10(9), 1214–1221. https://doi.org/10.1038/nn1954 Google Scholar

Bellemare, M. G., Dabney, W., & Munos, R. (2017). A distributional perspective on reinforcement learning. In Proceedings of Machine Learning Research. http://proceedings.mlr.press/v70/bellemare17a.html Google Scholar

Bellman, R. (1952). On the theory of dynamic programming. Proceedings of the National Academy of Sciences, 38, 716–719.Google Scholar

Belova, M. A., Paton, J. J., Morrison, S. E., & Salzman, C. D. (2007). Expectation modulates neural responses to pleasant and aversive stimuli in primate amygdala. Neuron, 55(6), 970–984. https://doi.org/10.1016/j.neuron.2007.08.004 Google Scholar

Bloem, B., Huda, R., Sur, M., & Graybiel, A. M. (2017). Two-photon imaging in mice shows striosomes and matrix have overlapping but differential reinforcement-related responses. eLife, 6. https://doi.org/10.7554/eLife.32353 CrossRef Google Scholar PubMed

Botvinick, M., & Toussaint, M. (2012). Planning as inference. Trends in Cognitive Sciences, 16(10), 485–488. https://doi.org/10.1016/j.tics.2012.08.006 Google Scholar

Boureau, Y. L., & Dayan, P. (2011). Opponency revisited: competition and cooperation between dopamine and serotonin. Neuropsychopharmacology, 36(1), 74–97. https://doi.org/10.1038/npp.2010.151 Google Scholar

Bromberg-Martin, E. S., Matsumoto, M., Hong, S., & Hikosaka, O. (2010). A pallidus-habenula-dopamine pathway signals inferred stimulus values. Journal of Neurophysiology, 104(2), 1068–1076. https://doi.org/10.1152/jn.00158.2010 Google Scholar

Cassell, M. D., Freedman, L. J., & Shi, C. (1999). The intrinsic organization of the central extended amygdala. Annals of New York Academy of Sciences, 877, 217–240.CrossRef Google Scholar

Chen, C., Takahashi, T., Nakagawa, S., Inoue, T., & Kusumi, I. (2015). Reinforcement learning in depression: a review of computational research. Neuroscience and Biobehavioral Reviews, 55, 247–267. https://doi.org/10.1016/j.neubiorev.2015.05.005 Google Scholar

Cilden, E., & Polat, F. (2015). Toward generalization of automated temporal abstraction to partially observable reinforcement learning. IEEE Transactions on Cybernetics, 45(8), 1414–1425. https://doi.org/10.1109/TCYB.2014.2352038 CrossRef Google Scholar PubMed

Collins, A. G., & Frank, M. J. (2014). Opponent actor learning (OpAL): modeling interactive effects of striatal dopamine on reinforcement learning and choice incentive. Psychological Review, 121(3), 337–366. https://doi.org/10.1037/a0037015 Google Scholar

Courville, A. C., Daw, N. D., & Touretzky, D. S. (2006). Bayesian theories of conditioning in a changing world. Trends in Cognitive Sciences, 10(7), 294–300. https://doi.org/10.1016/j.tics.2006.05.004 Google Scholar

Cui, G., Jun, S. B., Jin, X., et al. (2013). Concurrent activation of striatal direct and indirect pathways during action initiation. Nature, 494(7436), 238–242. https://doi.org/10.1038/nature11846 Google Scholar

Dabney, W., Kurth-Nelson, Z., Uchida, N., et al. (2020). A distributional code for value in dopamine-based reinforcement learning. Nature, 577(7792), 671–675. https://doi.org/10.1038/s41586-019-1924-6 Google Scholar

Dabney, W., Ostrovski, G., Silver, D., & Munos, R. M. (2018). Implicit quantile networks for distributional reinforcement learning. In 35th International Conference on Machine Learning (ICML 2018).Google Scholar

Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P., & Dolan, R. J. (2011). Model-based influences on humans’ choices and striatal prediction errors. Neuron, 69(6), 1204–1215. https://doi.org/10.1016/j.neuron.2011.02.027 Google Scholar

Daw, N. D., Kakade, S., & Dayan, P. (2002). Opponent interactions between serotonin and dopamine. Neural Networks, 15(4–6), 603–616. www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=12371515 Google Scholar

Daw, N. D., Niv, Y., & Dayan, P. (2005). Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience, 8(12), 1704–1711. https://doi.org/10.1038/nn1560 Google Scholar

Dayan, P. (1993). Improving generalization for temporal difference learning: the successor representation. Neural Computation, 5(4), 613–624. https://doi.org/10.1162/neco.1993.5.4.613 Google Scholar

Dayan, P., & Hinton, G. E. (1993). Feudal reinforcement learning. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in Neural Information Processing Systems 5 (pp. 271–278). San Francisco, CA: Morgan Kaufmann Publishers Inc.Google Scholar

Dayan, P., & Sejnowski, T. J. (1996). Exploration bonuses and dual control. Machine Learning, 25, 5–22.CrossRef Google Scholar

Dearden, R., Friedman, N., & Russell, S. (1998). Bayesian Q-learning. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI).Google Scholar

Delong, M. R. (1990). Primate models of movement disorders of basal ganglia origin. Trends in Neurosciences, 13, 281–285.Google Scholar

Devin, C., Gupta, A., Darrell, T., Abbeel, P., & Levine, S. (2017). Learning modular neural network policies for multi-task and multi-robot transfer. ICRA 2017 (online). https://doi.org/10.1109/ICRA.2017.7989250 Google Scholar

Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13, 227–303.Google Scholar

Doya, K. (1999). What are the computations of the cerebellum, the basal ganglia, and the cerebral cortex. Neural Networks, 12, 961–974. https://doi.org/10.1016/S0893-6080(99)00046-5 Google Scholar

Doya, K. (2000). Complementary roles of basal ganglia and cerebellum in learning and motor control. Current Opinion in Neurobiology, 10(6), 732–739.CrossRef Google Scholar PubMed

Doya, K. (2002). Metalearning and neuromodulation. Neural Networks, 15, 495–506. https://doi.org/10.1016/S0893-6080(02)00044-8 Google Scholar

Doya, K. (2008). Modulators of decision making. Nature Neuroscience, 11(4), 410–416. https://doi.org/10.1038/nn2077 Google Scholar

Doya, K. (2021). Canonical cortical circuits and the duality of Bayesian inference and optimal control. Current Opinion in Behavioral Sciences, 41, 160–166. https://doi.org/10.1016/j.cobeha.2021.07.003 Google Scholar

Doya, K., Miyazaki, K. W., & Miyazaki, K. (2021). Serotonergic modulation of cognitive computations. Current Opinion in Behavioral Sciences, 38, 116–123. https://doi.org/10.1016/j.cobeha.2021.02.003 Google Scholar

Doya, K., Samejima, K., Katagiri, K., & Kawato, M. (2002). Multiple model-based reinforcement learning. Neural Computation, 14(6), 1347–1369. https://doi.org/10.1162/089976602753712972 CrossRef Google Scholar PubMed

Doya, K., & Uchibe, E. (2005). The Cyber Rodent Project: exploration of adaptive mechanisms for self-preservation and self-reproduction. Adaptive Behavior, 13(2), 149–160. https://doi.org/10.1177/105971230501300206 Google Scholar

Elfwing, S., & Doya, K. (2014). Emergence of polymorphic mating strategies in robot colonies. PLoS One, 9(4), e93622. https://doi.org/10.1371/journal.pone.0093622 Google Scholar

Elfwing, S., Uchibe, E., Doya, K., & Christensen, H. I. (2011). Darwinian embodied evolution of the learning ability for survival. Adaptive Behavior, 19(2), 101–120. https://doi.org/10.1177/1059712310397633 Google Scholar

Evans, R. C., Twedell, E. L., Zhu, M., Ascencio, J., Zhang, R., & Khaliq, Z. M. (2020). Functional dissection of basal ganglia inhibitory inputs onto substantia nigra dopaminergic neurons. Cell Reports, 32(11), 108156. https://doi.org/10.1016/j.celrep.2020.108156 Google Scholar

Frank, M. J., Seeberger, L. C., & O’Reilly, R. C. (2004). By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science, 306(5703), 1940–1943. https://doi.org/10.1126/science.1102941 Google Scholar

Franklin, N. T., & Frank, M. J. (2018). Compositional clustering in task structure learning. PLoS Computational Biology, 14(4), e1006116. https://doi.org/10.1371/journal.pcbi.1006116 Google Scholar

Friston, K. J., Lin, M., Frith, C. D., Pezzulo, G., Hobson, J. A., & Ondobaka, S. (2017). Active inference, curiosity and insight. Neural Computation, 29(10), 2633–2683. https://doi.org/10.1162/neco_a_00999 Google Scholar

Fujimoto, A., & Takahashi, H. (2016). Flexible modulation of risk attitude during decision-making under quota. Neuroimage (online). https://doi.org/10.1016/j.neuroimage.2016.06.040 CrossRef Google Scholar

Fujimoto, A., Tsurumi, K., Kawada, R., et al. (2017). Deficit of state-dependent risk attitude modulation in gambling disorder. Translational Psychiatry, 7(4), e1085. https://doi.org/10.1038/tp.2017.55 CrossRef Google Scholar PubMed

Gerfen, C. R. (1984). The neostriatal mosaic: compartmentalization of corticostriatal input and striatonigral output systems. Nature, 311(5985), 461–464. https://doi.org/10.1038/311461a0 CrossRef Google Scholar PubMed

Gerfen, C. R. (1992). The neostriatal mosaic: multiple levels of compartmental organization in the basal ganglia. Annual Review of Neuroscience, 15, 285–320. https://doi.org/10.1146/annurev.ne.15.030192.001441 Google Scholar

Gerfen, C. R., Engber, T. M., Mahan, L. C., et al. (1990). D1 and D2 dopamine receptor-regulated gene expression of striatonigral and striatopallidal neurons. Science, 250(4986), 1429–1432. https://doi.org/10.1126/science.2147780 Google Scholar

Gershman, S. J. (2015). A unifying probabilistic view of associative learning. PLoS Computational Biology, 11(11), e1004567. https://doi.org/10.1371/journal.pcbi.1004567 Google Scholar

Gershman, S. J., Blei, D. M., & Niv, Y. (2010). Context, learning, and extinction. Psychological Review, 117(1), 197–209. https://doi.org/10.1037/a0017808 Google Scholar

Glimcher, P. W., & Fehr, E. (2013). Neuroeconomics: Decision Making and the Brain (2nd ed.). London: Elsevier.Google Scholar

Graybiel, A. M. (1991). Basal ganglia: input, neural activity, and relation to the cortex. Current Opinion in Neurobiology, 1(4), 644–651. https://doi.org/10.1016/s0959-4388(05)80043-1 Google Scholar

Graybiel, A. M., & Ragsdale, C. W., Jr. (1978). Histochemically distinct compartments in the striatum of human, monkeys, and cat demonstrated by acetylthiocholinesterase staining. Proceedings of the National Academy of Sciences, 75(11), 5723–5726. https://doi.org/10.1073/pnas.75.11.5723 Google Scholar

Haber, S. N., Fudge, J. L., & McFarland, N. R. (2000). Striatonigrostriatal pathways in primates form an ascending spiral from the shell to the dorsolateral striatum. Journal of Neuroscience, 20(6), 2369–2382. www.jneurosci.org/content/20/6/2369.full.pdf CrossRef Google Scholar PubMed

Haber, S. N., & Knutson, B. (2010). The reward circuit: linking primate anatomy and human imaging. Neuropsychopharmacology, 35(1), 4–26. https://doi.org/10.1038/npp.2009.129 Google Scholar

Hamid, A. A., Frank, M. J., & Moore, C. I. (2021). Wave-like dopamine dynamics as a mechanism for spatiotemporal credit assignment. Cell, 184(10), P2733–2749.E16. https://doi.org/10.1016/j.cell.2021.03.046 CrossRef Google Scholar

Haruno, M., & Kawato, M. (2006). Heterarchical reinforcement-learning model for integration of multiple cortico-striatal loops: fMRI examination in stimulus-action-reward association learning. Neural Networks, 19(8), 1242–1254. https://doi.org/10.1016/j.neunet.2006.06.007 Google Scholar

Haruno, M., Wolpert, D. M., & Kawato, M. (2001). Mosaic model for sensorimotor learning and control. Neural Computation, 13(10), 2201–2220. https://doi.org/10.1162/089976601750541778 Google Scholar

Hasselmo, M. E. (1999). Neuromodulation: acetylcholine and memory consolidation. Trends in Cognitive Sciences, 3(9), 351–359.Google Scholar

Hauert, C., Traulsen, A., Brandt, H., Nowak, M. A., & Sigmund, K. (2007). Via freedom to coercion: the emergence of costly punishment. Science, 316(5833), 1905–1907. https://doi.org/10.1126/science.1141588 Google Scholar

Hikida, T., Kimura, K., Wada, N., Funabiki, K., & Nakanishi, S. (2010). Distinct roles of synaptic transmission in direct and indirect striatal pathways to reward and aversive behavior. Neuron, 66(6), 896–907. https://doi.org/10.1016/j.neuron.2010.05.011 Google Scholar

Hilbe, C., Simsa, S., Chatterjee, K., & Nowak, M. A. (2018). Evolution of cooperation in stochastic games. Nature, 559, 246–249. https://doi.org/10.1038/s41586-018-0277-x Google Scholar

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 Google Scholar

Hoover, J. E., & Strick, P. L. (1993). Multiple output channels in the basal ganglia. Science, 259(5096), 819–821. https://doi.org/10.1126/science.7679223 Google Scholar

Houk, J. C., Adams, J. L., & Barto, A. G. (1995). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In Houk, J. C., Davis, J. L., & Beiser, D. G. (Eds.), Models of Information Processing in the Basal Ganglia (pp. 249–270). Cambridge, MA: MIT Press.Google Scholar

Hu, H., Cui, Y., & Yang, Y. (2020). Circuits and functions of the lateral habenula in health and in disease. Nature Reviews Neuroscience, 21, 277–295. https://doi.org/10.1038/s41583-020-0292-4 CrossRef Google Scholar

Huys, Q. J., Eshel, N., O’Nions, E., Sheridan, L., Dayan, P., & Roiser, J. P. (2012). Bonsai trees in your head: how the Pavlovian system sculpts goal-directed choices by pruning decision trees. PLoS Computational Biology, 8(3), e1002410. https://doi.org/10.1371/journal.pcbi.1002410 Google Scholar

Huys, Q. J. M., Browning, M., Paulus, M. P., & Frank, M. J. (2021). Advances in the computational understanding of mental illness. Neuropsychopharmacology, 46(1), 3–19. https://doi.org/10.1038/s41386-020-0746-4 Google Scholar

Iigaya, K., Fonseca, M. S., Murakami, M., Mainen, Z. F., & Dayan, P. (2018). An effect of serotonergic stimulation on learning rates for rewards apparent after long intertrial intervals. Nature Communications, 9(1), 2477. https://doi.org/10.1038/s41467-018-04840-2 Google Scholar

Ito, M., & Doya, K. (2011). Multiple representations and algorithms for reinforcement learning in the cortico-basal ganglia circuit. Current Opinion in Neurobiology, 21(3), 368–373. https://doi.org/10.1016/j.conb.2011.04.001 Google Scholar

Ito, M., & Doya, K. (2015a). Distinct neural representation in the dorsolateral, dorsomedial, and ventral parts of the striatum during fixed- and free-choice tasks. Journal of Neuroscience, 35(8), 3499–3514. https://doi.org/10.1523/JNEUROSCI.1962-14.2015 Google Scholar

Ito, M., & Doya, K. (2015b). Parallel representation of value-based and finite state-based strategies in the ventral and dorsal striatum. PLoS Computational Biology, 11(11), e1004540. https://doi.org/10.1371/journal.pcbi.1004540 Google Scholar

Kahneman, D., & Tversky, A. (1979). Prospect theory: an analysis of decision under risk. Econometrica, 47(2), 263–291.Google Scholar

Kakade, S., & Dayan, P. (2002). Dopamine: generalization and bonuses. Neural Networks, 15, 549–559.Google Scholar

Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Transactions of ASME, 82-D, 35–45.Google Scholar

Kalman, R. E., & Koepcke, R. W. (1958). Optimal synthesis of linear sampling control systems using general performance indexes. Transactions of ASME, 80, 1820–1826.Google Scholar

Kaplan, F., & Oudeyer, P.-Y. (2007). In search of the neural circuits of intrinsic motivation. Frontiers in Neuroscience, 1(1), 225–236. https://doi.org/10.3389/neuro.01.1.1.017.2007 Google Scholar

Kappen, H. J., Gómez, V., & Opper, M. (2012). Optimal control as a graphical model inference problem. Machine Learning, 87(2), 159–182. https://doi.org/10.1007/s10994-012-5278-7 CrossRef Google Scholar

Kim, H. R., Malik, A. N., Mikhael, J. G., et al. (2020). A unified framework for dopamine signals across timescales. Cell, 183(6), 1600–1616, e1625. https://doi.org/10.1016/j.cell.2020.11.013 Google Scholar

Kravitz, A. V., Tye, L. D., & Kreitzer, A. C. (2012). Distinct roles for direct and indirect pathway striatal neurons in reinforcement. Nature Neuroscience, 15(6), 816–818. https://doi.org/10.1038/nn.3100 CrossRef Google Scholar PubMed

Kurth-Nelson, Z., & Redish, A. D. (2009). Temporal-difference reinforcement learning with distributed representations. PLoS One, 4(10), e7362. https://doi.org/10.1371/journal.pone.0007362 Google Scholar

Laibson, D. I. (1997). Golden eggs and hyperbolic discounting. The Quarterly Journal of Economics, 62, 443–477.Google Scholar

Langdon, A. J., Sharpe, M. J., Schoenbaum, G., & Niv, Y. (2018). Model-based predictions for dopamine. Current Opinion in Neurobiology, 49, 1–7. https://doi.org/10.1016/j.conb.2017.10.006 Google Scholar

Langdon, A. J., Song, M., & Niv, Y. (2019). Uncovering the “state”: tracing the hidden state representations that structure learning and decision-making. Behavioural Processes, 167, 103891. https://doi.org/10.1016/j.beproc.2019.103891 CrossRef Google Scholar PubMed

Levine, S. (2018). Reinforcement learning and control as probabilistic inference: tutorial and review. arXiv, 1805.00909Google Scholar

Levy, D. J., & Glimcher, P. W. (2011). Comparing apples and oranges: using reward-specific and reward-general subjective value representation in the brain. Journal of Neuroscience, 31(41), 14693–14707. https://doi.org/10.1523/JNEUROSCI.2218-11.2011 Google Scholar

Li, Y., Zhong, W., Wang, D., et al. (2016). Serotonin neurons in the dorsal raphe nucleus encode reward signals. Nature Communications, 7, 10503. https://doi.org/10.1038/ncomms10503 Google Scholar

Liu, Z., Zhou, J., Li, Y., et al. (2014). Dorsal raphe neurons signal reward through 5-HT and glutamate. Neuron, 81(6), 1360–1374. https://doi.org/10.1016/j.neuron.2014.02.010 Google Scholar

Lowet, A. S., Zheng, Q., Matias, S., Drugowitsch, J., & Uchida, N. (2020). Distributional reinforcement learning in the brain. Trends in Neurosciences, 43(12), 980–997. https://doi.org/10.1016/j.tins.2020.09.004 Google Scholar

Maslow, A. H. (1943). A theory of human motivation. Psychological Review, 50(4), 370–396. https://doi.org/10.1037/h0054346 Google Scholar

Mathys, C., Daunizeau, J., Friston, K. J., & Stephan, K. E. (2011). A Bayesian foundation for individual learning under uncertainty. Frontiers in Human Neuroscience, 5, 39. https://doi.org/10.3389/fnhum.2011.00039 CrossRef Google Scholar PubMed

Matias, S., Lottem, E., Dugue, G. P., & Mainen, Z. F. (2017). Activity patterns of serotonin neurons underlying cognitive flexibility. Elife, 6 (online). https://doi.org/10.7554/eLife.20552 Google Scholar

Matsumoto, M., & Hikosaka, O. (2007). Lateral habenula as a source of negative reward signals in dopamine neurons. Nature, 447(7148), 1111–1115. https://doi.org/10.1038/nature05860 Google Scholar

Matsumoto, M., & Hikosaka, O. (2009). Two types of dopamine neuron distinctly convey positive and negative motivational signals. Nature, 459(7248), 837–841. https://doi.org/10.1038/nature08028 Google Scholar

Menegas, W., Akiti, K., Amo, R., Uchida, N., & Watabe-Uchida, M. (2018). Dopamine neurons projecting to the posterior striatum reinforce avoidance of threatening stimuli. Nature Neuroscience, 21, 1421–1430. https://doi.org/10.1038/s41593-018-0222-1 Google Scholar

Miyazaki, K., Miyazaki, K. W., Sivori, G., Yamanaka, A., Tanaka, K. F., & Doya, K. (2020). Serotonergic projections to the orbitofrontal and medial prefrontal cortices differentially modulate waiting for future rewards. Science Advances, 6(48), eabc7246. https://doi.org/10.1126/sciadv.abc7246 Google Scholar

Miyazaki, K., Miyazaki, K. W., Yamanaka, A., Tokuda, T., Tanaka, K. F., & Doya, K. (2018). Reward probability and timing uncertainty alter the effect of dorsal raphe serotonin neurons on patience. Nature Communications, 9(1), 2048. https://doi.org/10.1038/s41467-018-04496-y Google Scholar

Miyazaki, K. W., Miyazaki, K., Tanaka, K. F., et al. (2014). Optogenetic activation of dorsal raphe serotonin neurons enhances patience for future rewards. Current Biology, 24(17), 2033–2040. https://doi.org/10.1016/j.cub.2014.07.041 Google Scholar

Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. https://doi.org/10.1038/nature14236 CrossRef Google Scholar PubMed

Montague, P. R., Dolan, R. J., Friston, K. J., & Dayan, P. (2012). Computational psychiatry. Trends in Cognitive Sciences, 16(1), 72–80. https://doi.org/10.1016/j.tics.2011.11.018 Google Scholar

Mordatch, I., & Abbeel, P. (2017). Emergence of grounded compositional language in multi-agent populations. https://arxiv.org/abs/1703.04908 Google Scholar

Morimoto, J., & Doya, K. (2001). Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robotics and Autonomous Systems, 36, 37–51. https://doi.org/10.1016/S0921-8890(01)00113-0 CrossRef Google Scholar

Muelling, K., Boularias, A., Mohler, B., Scholkopf, B., & Peters, J. (2014). Learning strategies in table tennis using inverse reinforcement learning. Biological Cybernetics (online). https://doi.org/10.1007/s00422-014-0599-1 Google Scholar

Mukherjee, D., Lee, S., Kazinka, R., & Kable, J. W. (2020). Multiple facets of value-based decision making in major depressive disorder. Scientific Reports, 10(1), 3415. https://doi.org/10.1038/s41598-020-60230-z Google Scholar

Munuera, J., Rigotti, M., & Salzman, C. D. (2018). Shared neural coding for social hierarchy and reward value in primate amygdala. Nature Neuroscience, 21(3), 415–423. https://doi.org/10.1038/s41593-018-0082-8 Google Scholar

Nagai, Y., Takayama, K., Nishitani, N., et al. (2020). The role of dorsal raphe serotonin neurons in the balance between reward and aversion. International Journal of Molecular Sciences, 21(6). https://doi.org/10.3390/ijms21062160 Google Scholar

Nakahara, H., Doya, K., & Hikosaka, O. (2001). Parallel cortico-basal ganglia mechanisms for acquisition and execution of visuo-motor sequences: a computational approach. Journal of Cognitive Neuroscience, 13(5), 626–647. https://doi.org/10.1162/089892901750363208 Google Scholar

Nassar, M. R., Wilson, R. C., Heasly, B., & Gold, J. I. (2010). An approximately Bayesian delta-rule model explains the dynamics of belief updating in a changing environment. Journal of Neuroscience, 30(37), 12366–12378. https://doi.org/10.1523/JNEUROSCI.0822-10.2010 CrossRef Google Scholar

Ng, A. Y., & Russell, S. (2000). Algorithms for inverse reinforcement learning. In 17th International Conference on Machine Learning.Google Scholar

Nishijo, H., Ono, T., & Nishino, H. (1988). Topographic distribution of modality-specific amygdalar neurons in alert monkey. Journal of Neuroscience, 8(10), 3556–3569. https://doi.org/10.1523/jneurosci.08-10-03556.1988 Google Scholar

Ohmura, Y., Iwami, K., Chowdhury, S., et al. (2021). Disruption of model-based decision making by silencing of serotonin neurons in the dorsal raphe nucleus. Current Biology, 31(11), 2446–2454. https://doi.org/10.1016/j.cub.2021.03.048 Google Scholar

Ohtsuki, H., Hauert, C., Lieberman, E., & Nowak, M. A. (2006). A simple rule for the evolution of cooperation on graphs and social networks. Nature, 441(7092), 502–505. https://doi.org/10.1038/nature04605 Google Scholar

Ohtsuki, H., Iwasa, Y., & Nowak, M. A. (2009). Indirect reciprocity provides only a narrow margin of efficiency for costly punishment. Nature, 457(7225), 79–82. https://doi.org/10.1038/nature07601 Google Scholar

Pabba, M. (2013). Evolutionary development of the amygdaloid complex. Frontiers in Neuroanatomy, 7, 27. https://doi.org/10.3389/fnana.2013.00027 CrossRef Google Scholar PubMed

Palminteri, S., Khamassi, M., Joffily, M., & Coricelli, G. (2015). Contextual modulation of value signals in reward and punishment learning. Nature Communications, 6, 8096. https://doi.org/10.1038/ncomms9096 CrossRef Google Scholar PubMed

Palminteri, S., & Pessiglione, M. (2017). Opponent brain systems for reward and punishment learning: causal evidence from drug and lesion studies in humans. Decision Neuroscience, 2017, 291–303. https://doi.org/10.1016/B978-0-12-805308-9.00023-3 Google Scholar

Parr, T., & Friston, K. J. (2017). Uncertainty, epistemics and active inference. Journal of the Royal Society Interface, 14(136). https://doi.org/10.1098/rsif.2017.0376 Google Scholar

Pearce, J. M., & Bouton, M. E. (2001). Theories of associative learning in animals. Annual Review of Psychology, 52, 111–139. https://doi.org/10.1146/annurev.psych.52.1.111 Google Scholar

Redgrave, P., Prescott, T. J., & Gurney, K. (1999). Is the short-latency dopamine response too short to signal reward error? Trends in Neuroscience, 22(4), 146–151. https://doi.org/10.1016/s0166-2236(98)01373-3 Google Scholar

Redish, A. D. (2004). Addiction as a computational process gone awry. Science, 306, 1944–1947.Google Scholar

Redish, A. D., & Gordon, J. A. (2016). Computational Psychiatry. Cambridge, MA: MIT Press. https://doi.org/10.7551/mitpress/9780262035422.001.0001 Google Scholar

Reiss, S. (2012). Intrinsic and extrinsic motivation. Teaching of Psychology, 39(2), 152–156. https://doi.org/10.1177/0098628312437704 Google Scholar

Safra, L., Chevallier, C., & Palminteri, S. (2019). Depressive symptoms are associated with blunted reward learning in social contexts. PLoS Computational Biology, 15(7), e1007224. https://doi.org/10.1371/journal.pcbi.1007224 Google Scholar

Sales, A. C., Friston, K. J., Jones, M. W., Pickering, A. E., & Moran, R. J. (2019). Locus Coeruleus tracking of prediction errors optimises cognitive flexibility: an active inference model. PLoS Computational Biology, 15(1), e1006267. https://doi.org/10.1371/journal.pcbi.1006267 Google Scholar

Samejima, K., & Doya, K. (2007). Multiple representations of belief states and action values in corticobasal ganglia loops. Annals of the New York Academy of Sciences, 1104, 213–228. https://doi.org/10.1196/annals.1390.024 Google Scholar

Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80, 1–27.Google Scholar

Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. https://doi.org/10.1126/science.275.5306.1593 Google Scholar

Schweighofer, N., & Doya, K. (2003). Meta-learning of reinforcement learning. Neural Networks, 16(1), 5–9. https://doi.org/10.1016/S0893-6080(02)00228-9 Google Scholar

Singh, S. P. (1992). Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8(3/4), 323–340. https://doi.org/10.1023/A:1022680823223 Google Scholar

Sippy, T., Lapray, D., Crochet, S., & Petersen, C. C. (2015). Cell-type-specific sensorimotor processing in striatal projection neurons during goal-directed behavior. Neuron, 88(2), 298–305. https://doi.org/10.1016/j.neuron.2015.08.039 Google Scholar

Soma, M., Aizawa, H., Ito, Y., et al. (2009). Development of the mouse amygdala as revealed by enhanced green fluorescent protein gene transfer by means of in utero electroporation. Journal of Comparative Neurology, 513(1), 113–128. https://doi.org/10.1002/cne.21945 Google Scholar

Stachenfeld, K. L., Botvinick, M. M., & Gershman, S. J. (2017). The hippocampus as a predictive map. Nature Neuroscience, 20(11), 1643–1653. https://doi.org/10.1038/nn.4650 Google Scholar

Starkweather, C. K., & Uchida, N. (2021). Dopamine signals as temporal difference errors: recent advances. Current Opinion in Neurobiology, 67, 95–105. https://doi.org/10.1016/j.conb.2020.08.014 Google Scholar

Sugimoto, N., Haruno, M., Doya, K., & Kawato, M. (2012). MOSAIC for multiple-reward environments. Neural Computation, 24(3), 577–606. https://doi.org/10.1162/NECO_a_00246 Google Scholar

Sun, R. (2009). Motivational representations within a computational cognitive architecture. Cognitive Computation, 1(1), 91–103. https://doi.org/10.1007/s12559-009-9005-z Google Scholar

Sun, R., & Sessions, C. (2000). Self-segmentation of sequences: automatic formation of hierarchies of sequential behaviors. IEEE Transactions on Systems, Man, and Cybernetics, 30(3), 403–418. https://doi.org/10.1109/3477.846230 Google Scholar

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). Cambridge, MA: MIT Press.Google Scholar

Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1–2), 181–211. https://doi.org/10.1016/s0004-3702(99)00052-1 Google Scholar

Takahashi, H. (2012). Monoamines and assessment of risks. Current Opinion in Neurobiology, 22(6), 1062–1067. https://doi.org/10.1016/j.conb.2012.06.003 Google Scholar

Takahashi, H., Fujie, S., Camerer, C., et al. (2013). Norepinephrine in the brain is associated with aversion to financial loss. Molecular Psychiatry, 18(1), 3–4. https://doi.org/10.1038/mp.2012.7 Google Scholar

Takeuchi, H., Kawada, R., Tsurumi, K., et al. (2015). Heterogeneity of loss aversion in pathological gambling. Journal of Gambling Studies, 32, 1143–1154. https://doi.org/10.1007/s10899-015-9587-1 Google Scholar

Takeuchi, H., Tsurumi, K., Murao, T., et al. (2017). Common and differential brain abnormalities in gambling disorder subtypes based on risk attitude. Addictive Behaviors, 69, 48–54. https://doi.org/10.1016/j.addbeh.2017.01.025 Google Scholar

Tanaka, S. C., Yahata, N., Todokoro, A., et al. (2018). Preliminary evidence of altered neural response during intertemporal choice of losses in adult attention-deficit hyperactivity disorder. Scientific Reports, 8(1), 6703. https://doi.org/10.1038/s41598-018-24944-5 Google Scholar

Tecuapetla, F., Jin, X., Lima, S. Q., & Costa, R. M. (2016). Complementary contributions of striatal projection pathways to action initiation and execution. Cell, 166(3), 703–715. https://doi.org/10.1016/j.cell.2016.06.032 CrossRef Google Scholar PubMed

Thrun, S., & Pratt, L. (Eds.). (1998). Learning to Learn. New York, NY: Springer. https://doi.org/10.1007/978-1-4615-5529-2.Google Scholar

Todorov, E. (2008). General duality between optimal control and estimation. In The 47th IEEE Conference on Decision and Control.Google Scholar

Todorov, E. (2009). Parallels between sensory and motor information processing. In M. S. Gazzaniga (Ed.), The Cognitive Neurosciences, 4th ed. Cambridge, MA: MIT Press.Google Scholar

Uchibe, E. (2017). Model-free deep inverse reinforcement learning by logistic regression. Neural Processing Letters, 47, 891–905. https://doi.org/10.1007/s11063-017-9702-7 Google Scholar

Uchibe, E., & Doya, K. (2014). Inverse reinforcement learning using Dynamic Policy Programming. In 4th International Conference on Development and Learning and on Epigenetic Robotics.Google Scholar

Uchibe, E., & Doya, K. (2021). Forward and inverse reinforcement learning sharing network weights and hyperparameters. Neural Networks, 144, 138–153. https://doi.org/10.1016/j.neunet.2021.08.017 Google Scholar

van den Bos, W., Talwar, A., & McClure, S. M. (2013). Neural correlates of reinforcement learning and social preferences in competitive bidding. Journal of Neuroscience, 33(5), 2137–2146. https://doi.org/10.1523/JNEUROSCI.3095-12.2013 Google Scholar

von Neumann, J., & Morgenstern, O. (1944). Theory of Games and Economic Behavior. Princeton, NJ: Princeton University Press.Google Scholar

Voorn, P., Vanderschuren, L. J., Groenewegen, H. J., Robbins, T. W., & Pennartz, C. M. (2004). Putting a spin on the dorsal-ventral divide of the striatum. Trends in Neuroscience, 27(8), 468–474. https://doi.org/10.1016/j.tins.2004.06.006 Google Scholar

Wang, J. X., Kurth-Nelson, Z., Kumaran, D., et al. (2018). Prefrontal cortex as a meta-reinforcement learning system. Nature Neuroscience, 21(6), 860–868. https://doi.org/10.1038/s41593-018-0147-8 Google Scholar

Watabe-Uchida, M., Eshel, N., & Uchida, N. (2017). Neural circuitry of reward prediction error. Annual Review of Neuroscience, 40, 373–394. https://doi.org/10.1146/annurev-neuro-072116-031109 Google Scholar

Wiering, M., & Schmidhuber, J. (1998). HQ-learning. Adaptive Behavior, 6, 219–246.Google Scholar

Yamagata, N., Ichinose, T., Aso, Y., et al. (2014). Distinct dopamine neurons mediate reward signals for short- and long-term memories. Proceedings of the National Academy of Sciences, 112(2), 578–583. https://doi.org/10.1073/pnas.1421930112 Google Scholar

Yamaguchi, S., Naoki, H., Ikeda, M., et al. (2018). Identification of animal behavioral strategies by inverse reinforcement learning. PLoS Computational Biology, 14(5), e1006122. https://doi.org/10.1371/journal.pcbi.1006122 Google Scholar

Yang, G. R., Joglekar, M. R., Song, H. F., Newsome, W. T., & Wang, X. J. (2019). Task representations in neural networks trained to perform many cognitive tasks. Nature Neuroscience, 22(2), 297–306. https://doi.org/10.1038/s41593-018-0310-2 Google Scholar

Yoshida, W., Dolan, R. J., & Friston, K. J. (2008). Game theory of mind. PLoS Computational Biology, 4(12), e1000254. https://doi.org/10.1371/journal.pcbi.1000254 Google Scholar

Yoshizawa, T., Ito, M., & Doya, K. (2018). Reward-predictive neural activities in striatal striosome compartments. eNeuro, 5(1), e0367–0317.2018. https://doi.org/10.1523/ENEURO.0367-17.2018 Google Scholar

Yu, A. J., & Dayan, P. (2005). Uncertainty, neuromodulation, and attention. Neuron, 46(4), 681–692. https://doi.org/10.1016/j.neuron.2005.04.026 Google Scholar

Ziebart, B., Bagnell, J., & Dey, A. (2010). Modeling interaction via the principle of maximum causal entropy. In International Conference on Machine Learning.Google Scholar

Ziebart, B., Maas, A., Bagnell, J., & Dey, A. (2008). Maximum entropy inverse reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2008).Google Scholar