Bowers et al. challenge the notion that deep neural networks (DNNs) are the best or even a highly promising model of human cognition and recommend that future studies should test specific psychological neural phenomena and potential hypotheses by independently manipulating factors.
We agree with Bowers et al. that overall predictive power is not sufficient to have a good model, in particular not when experiments are lacking in diversity of tested stimuli (Grootswagers & Robinson, Reference Grootswagers and Robinson2021). Nevertheless, prediction is a necessary condition and a good starting point. DNNs have the power to serve as a generic model, and at the same time they can be tested on a variety of cognitive/psychological phenomena to go beyond prediction and give insight to understand the functioning of a system. Strikingly, and in contrast to how the literature is characterized by Bowers et al., the first wave of studies comparing DNNs to human vision already included studies that went beyond mere prediction on generic stimulus sets. To just take one example, the study by Kubilius, Bracci, and Op de Beeck (Reference Kubilius, Bracci and Op de Beeck2016) that is characterized as a prediction-based experiment by Bowers et al., tested a specific cognitive hypothesis (the role of nonaccidental properties, see Biederman, Reference Biederman1987) and independently manipulated shape and category similarity (Kubilius et al., Reference Kubilius, Bracci and Op de Beeck2016; see also Bracci & Op de Beeck, Reference Bracci and Op de Beeck2016; Zeman, Ritchie, Bracci, & Op de Beeck, Reference Zeman, Ritchie, Bracci and Op de Beeck2020). More in general, the goal of explanation over prediction is already a central one as shown by examples of recent work testing underlying mechanisms of object perception (Singer, Seeliger, Kietzmann, & Hebart, Reference Singer, Seeliger, Kietzmann and Hebart2022), category domains (Dobs, Martinez, Kell, & Kanwisher, Reference Dobs, Martinez, Kell and Kanwisher2022), or predictive coding (Ali, Ahmad, de Groot, van Gerven, & Kietzmann, Reference Ali, Ahmad, de Groot, van Gerven and Kietzmann2022), just to mention a few. The wealth of data that the community has gathered with DNNs in less than a decade illustrates the potential of this approach.
Bowers et al. provide many examples of failures of DNNs, on the side admitting some of the successes and progress. Many of the failures show that vanilla DNNs, as is true for all models, are not perfect and do not capture all aspects of brain processing. Revealing such limitations is generally considered essential to move the field forward towards making DNN computations more human-like (Firestone, Reference Firestone2020), and is no reason to abandon these models as long as there is an obvious road ahead with them. Some proposed examples are the addition of optical limitations reminiscent of the human eye that can make a network more robust to adversarial attacks (Elsayed et al., Reference Elsayed, Shankar, Cheung, Papernot, Kurakin, Goodfellow and Sohl-Dickstein2018), the implementation of intuitive physics (Piloto, Weinstein, Battaglia, & Botvinick, Reference Piloto, Weinstein, Battaglia and Botvinick2022), or considerations about the influence of visual system maturation and low visual acuity at birth (Avberšek, Zeman, & Op de Beeck, Reference Avberšek, Zeman and Op de Beeck2021; Jinsi, Henderson, & Tarr, Reference Jinsi, Henderson and Tarr2023).
It is difficult to reconcile a fundamental criticism of DNNs that they do not capture all psychological phenomena without further extensions, with the proposal of Bower et al to switch to alternative strategies that are much more limited in terms of the extent to which they capture the full complexity of information processing from input to output (e.g., Grossberg, Reference Grossberg1987; Hummel & Biederman, Reference Hummel and Biederman1992; McClelland, Rumelhart, & PDP Research Group, Reference McClelland and Rumelhart1986, Psych Rev). These alternative models are very appealing but also more narrow in scope. Consider, for example, the simplicity with which the well-known ALCOVE model explains categorization (Kruschke, Reference Kruschke1992), compared to the complex high-dimensional space that is the actual reality of the underlying representations (for a review, see Bracci & Op de Beeck, Reference Bracci and Op de Beeck2023). Note that we consider these alternatives to be an excellent way to obtain a conceptual understanding of a phenomenon, we all very much build on top of this pioneering work using conceptually elegant models with few parameters (e.g., Ritchie & Op de Beeck, Reference Ritchie and Op de Beeck2019). Nevertheless, scientists should not stop there. If we would, then we would be left with a wide range of niche solutions and no progress towards either a generic model that can be applied across domains, or at least a path towards it. Luckily, this path looks very promising for DNNs, given that there is a large community of relatively junior scientists that is ready to make progress (e.g., Doerig et al., Reference Doerig, Sommers, Seeliger, Richards, Ismael, Lindsay and Kietzmann2023; Naselaris et al., Reference Naselaris, Bassett, Fletcher, Kording, Kriegeskorte, Nienborg and Kay2018). The necessary modifications will move the needle in various directions, such as elaborations in terms of front-ends, architecture, learning and optimization rules, learning regime, level of neural detail (e.g., spiking networks), the addition of attentional and working memory processes, and potentially the interaction with symbolic processing. None of that will lead to the dismissal of DNNs.
We see the high capacity of DNNs as a feature, not a bug, and currently we are still on the part of the curve where higher capacity means better (Elmoznino & Bonner, Reference Elmoznino and Bonner2022). In contrast to the alternatives, DNNs confront us upfront with the complexity of human information processing because they have to work vis-à-vis an actual stimulus as an input. This is not just a faits divers, it is a necessary condition for the ideal model. DNNs and related artificial intelligence (AI) models seem to be able to stand up to this challenge, even up to the point that in some domains they can already predict empirical data about neural selectivity to real images to a greater extent than professors in cognitive neuroscience (Ratan Murty, Bashivan, Abate, DiCarlo, & Kanwisher, Reference Ratan Murty, Bashivan, Abate, DiCarlo and Kanwisher2021). The general applicability of these models and the legacy of knowledge that has by now been obtained provides a unique resource to test a wide variety of psychological and neural phenomena (e.g., Duyck, Bracci, & Op de Beeck, Reference Duyck, Bracci and Op de Beeck2022; Kanwisher, Gupta, & Dobs, Reference Kanwisher, Gupta and Dobs2023).
The way forward is to build better models, including DNN-based models that take the complexity of human vision and cognition seriously (Bracci & Op de Beeck, Reference Bracci and Op de Beeck2023). As it has been since the very early days of AI, we need continuous interaction and exchange between disciplines and their expertise at all levels (cognitive and computational psychologists, computer vision scientists, philosophers of the mind, neuroscientists) to bring us towards a common goal of a human-like AI that we understand mechanistically. Solving the deep problem of understanding biological vision will not happen by too easily dismissing DNNs and missing out on their potential.
Bowers et al. challenge the notion that deep neural networks (DNNs) are the best or even a highly promising model of human cognition and recommend that future studies should test specific psychological neural phenomena and potential hypotheses by independently manipulating factors.
We agree with Bowers et al. that overall predictive power is not sufficient to have a good model, in particular not when experiments are lacking in diversity of tested stimuli (Grootswagers & Robinson, Reference Grootswagers and Robinson2021). Nevertheless, prediction is a necessary condition and a good starting point. DNNs have the power to serve as a generic model, and at the same time they can be tested on a variety of cognitive/psychological phenomena to go beyond prediction and give insight to understand the functioning of a system. Strikingly, and in contrast to how the literature is characterized by Bowers et al., the first wave of studies comparing DNNs to human vision already included studies that went beyond mere prediction on generic stimulus sets. To just take one example, the study by Kubilius, Bracci, and Op de Beeck (Reference Kubilius, Bracci and Op de Beeck2016) that is characterized as a prediction-based experiment by Bowers et al., tested a specific cognitive hypothesis (the role of nonaccidental properties, see Biederman, Reference Biederman1987) and independently manipulated shape and category similarity (Kubilius et al., Reference Kubilius, Bracci and Op de Beeck2016; see also Bracci & Op de Beeck, Reference Bracci and Op de Beeck2016; Zeman, Ritchie, Bracci, & Op de Beeck, Reference Zeman, Ritchie, Bracci and Op de Beeck2020). More in general, the goal of explanation over prediction is already a central one as shown by examples of recent work testing underlying mechanisms of object perception (Singer, Seeliger, Kietzmann, & Hebart, Reference Singer, Seeliger, Kietzmann and Hebart2022), category domains (Dobs, Martinez, Kell, & Kanwisher, Reference Dobs, Martinez, Kell and Kanwisher2022), or predictive coding (Ali, Ahmad, de Groot, van Gerven, & Kietzmann, Reference Ali, Ahmad, de Groot, van Gerven and Kietzmann2022), just to mention a few. The wealth of data that the community has gathered with DNNs in less than a decade illustrates the potential of this approach.
Bowers et al. provide many examples of failures of DNNs, on the side admitting some of the successes and progress. Many of the failures show that vanilla DNNs, as is true for all models, are not perfect and do not capture all aspects of brain processing. Revealing such limitations is generally considered essential to move the field forward towards making DNN computations more human-like (Firestone, Reference Firestone2020), and is no reason to abandon these models as long as there is an obvious road ahead with them. Some proposed examples are the addition of optical limitations reminiscent of the human eye that can make a network more robust to adversarial attacks (Elsayed et al., Reference Elsayed, Shankar, Cheung, Papernot, Kurakin, Goodfellow and Sohl-Dickstein2018), the implementation of intuitive physics (Piloto, Weinstein, Battaglia, & Botvinick, Reference Piloto, Weinstein, Battaglia and Botvinick2022), or considerations about the influence of visual system maturation and low visual acuity at birth (Avberšek, Zeman, & Op de Beeck, Reference Avberšek, Zeman and Op de Beeck2021; Jinsi, Henderson, & Tarr, Reference Jinsi, Henderson and Tarr2023).
It is difficult to reconcile a fundamental criticism of DNNs that they do not capture all psychological phenomena without further extensions, with the proposal of Bower et al to switch to alternative strategies that are much more limited in terms of the extent to which they capture the full complexity of information processing from input to output (e.g., Grossberg, Reference Grossberg1987; Hummel & Biederman, Reference Hummel and Biederman1992; McClelland, Rumelhart, & PDP Research Group, Reference McClelland and Rumelhart1986, Psych Rev). These alternative models are very appealing but also more narrow in scope. Consider, for example, the simplicity with which the well-known ALCOVE model explains categorization (Kruschke, Reference Kruschke1992), compared to the complex high-dimensional space that is the actual reality of the underlying representations (for a review, see Bracci & Op de Beeck, Reference Bracci and Op de Beeck2023). Note that we consider these alternatives to be an excellent way to obtain a conceptual understanding of a phenomenon, we all very much build on top of this pioneering work using conceptually elegant models with few parameters (e.g., Ritchie & Op de Beeck, Reference Ritchie and Op de Beeck2019). Nevertheless, scientists should not stop there. If we would, then we would be left with a wide range of niche solutions and no progress towards either a generic model that can be applied across domains, or at least a path towards it. Luckily, this path looks very promising for DNNs, given that there is a large community of relatively junior scientists that is ready to make progress (e.g., Doerig et al., Reference Doerig, Sommers, Seeliger, Richards, Ismael, Lindsay and Kietzmann2023; Naselaris et al., Reference Naselaris, Bassett, Fletcher, Kording, Kriegeskorte, Nienborg and Kay2018). The necessary modifications will move the needle in various directions, such as elaborations in terms of front-ends, architecture, learning and optimization rules, learning regime, level of neural detail (e.g., spiking networks), the addition of attentional and working memory processes, and potentially the interaction with symbolic processing. None of that will lead to the dismissal of DNNs.
We see the high capacity of DNNs as a feature, not a bug, and currently we are still on the part of the curve where higher capacity means better (Elmoznino & Bonner, Reference Elmoznino and Bonner2022). In contrast to the alternatives, DNNs confront us upfront with the complexity of human information processing because they have to work vis-à-vis an actual stimulus as an input. This is not just a faits divers, it is a necessary condition for the ideal model. DNNs and related artificial intelligence (AI) models seem to be able to stand up to this challenge, even up to the point that in some domains they can already predict empirical data about neural selectivity to real images to a greater extent than professors in cognitive neuroscience (Ratan Murty, Bashivan, Abate, DiCarlo, & Kanwisher, Reference Ratan Murty, Bashivan, Abate, DiCarlo and Kanwisher2021). The general applicability of these models and the legacy of knowledge that has by now been obtained provides a unique resource to test a wide variety of psychological and neural phenomena (e.g., Duyck, Bracci, & Op de Beeck, Reference Duyck, Bracci and Op de Beeck2022; Kanwisher, Gupta, & Dobs, Reference Kanwisher, Gupta and Dobs2023).
The way forward is to build better models, including DNN-based models that take the complexity of human vision and cognition seriously (Bracci & Op de Beeck, Reference Bracci and Op de Beeck2023). As it has been since the very early days of AI, we need continuous interaction and exchange between disciplines and their expertise at all levels (cognitive and computational psychologists, computer vision scientists, philosophers of the mind, neuroscientists) to bring us towards a common goal of a human-like AI that we understand mechanistically. Solving the deep problem of understanding biological vision will not happen by too easily dismissing DNNs and missing out on their potential.
Financial support
H. O. B. is supported by FWO research project G073122N and KU Leuven project IDN/21/010.
Competing interest
None.