In this commentary, my aim is to develop four separate objections to arguments in the target article. These concern (i) interfaces between representational formats, (ii) how to interpret the P600 ERP signature, (iii) the relation between deep neural network (DNN) models and innateness, and (iv) the significance of performance measures in evaluating DNNs.
Let's begin with what the authors call the “interface problem.” They argue that “if cognition is largely LoT-like, and perception feeds information to cognition, then we should expect at least some elements of perception to be LoT-like, because the two systems need to interface” (target article, sect. 4, para. 2). This claim, though common, is puzzling. If DNN models have demonstrated anything, it's that virtually any representational format can be transformed into any other, given suitable training. Names can be mapped to faces, spatial arrays to numerical quantities, letters into phonemes, intentions into motoric instructions, and so on. Moreover, DNNs routinely do this in ways that appear not to be sensitive to any syntactic properties of the interfacing representations (Arbib, Reference Arbib2003). This strongly suggests that two interfacing systems need not have much, if anything, in common with one another, regardless of whether either of them is language-of-thought (LoT)-like.
If that's correct, then it raises the larger question of what the interface problem was ever supposed to be. The issue has been heavily studied in the theory of action, but, tellingly, the prominent solutions in this area often multiply the number of representational formats, introducing a new kind of demonstrative concept (Butterfill & Sinigaglia, Reference Butterfill and Sinigaglia2014) or a “motor schema” that mediates between intentions and low-level motoric instructions (Mylopoulos & Pacherie, Reference Mylopoulos and Pacherie2017). As Christensen (Reference Christensen2021) in effect points out, the question of how such representations are mapped into one another is not going to be answered by reference to their representational format(s). The substantive questions are how such mappings arise and what happens on the occasions when they fail. Plausible answers to these questions will likely appeal to learning, innate endowment, and low-level neurocognitive mechanisms, but not to format.
Turn now to some of the neurocognitive evidence that the authors marshal. They argue that “structured relations in scene grammar display curious hallmarks of language-like formats. For instance, the P600 ERP increases for syntactic violations in language, and also increases for stimuli that violate visual scene ‘syntax’” (target article, sect. 4.2.2, para. 4). However, the P600 has a variety of interpretations, and not all these fit neatly with the authors' reasoning. For instance, the P600 may be a trigger for conscious reevaluation of a stimulus that has caused processing issues, regardless of whether the underlying representational system is LoT-like (Batterink & Neville, Reference Batterink and Neville2013; van Gaal et al., Reference van Gaal, Naccache, Meuwese, van Loon, Leighton, Cohen and Dehaene2014). More importantly, the P600 has been shown to reflect incongruence or discordance in tonal music (Featherstone, Morrison, Waterman, & MacGregor, Reference Featherstone, Morrison, Waterman and MacGregor2013), demonstrating that a representational format – in this case, that of musical cognition – need not involve predication, logical operations, or automatic inferential promiscuity in order to induce a P600 response.
The authors might reply that musical cognition is demonstratively sensitive to recursive structure and that it exhibits filler-role relations (Lerdahl & Jackendoff, Reference Lerdahl and Jackendoff1983). The representational format involved is, thus, arguably LoT-like. But this response raises the deeper question of what it would take to falsify their main proposal. If musical cognition meets only half of the criteria that they take to be indicative of an LoT-like format, does this constitute a refutation of their hypothesis that such criteria naturally cluster together? If not, then what would?
Let's now consider issues in animal cognition. The authors argue that the paucity of relevant input to a newborn chick's visual system prior to an experiment “points away from DNN-based explanations of abstract object representations” (target article, sect. 5.1, para. 5). The idea seems to be that, if a representational capacity is innate, rather than acquired through some type of learning, then DNN models of this capacity are superfluous. But this argument runs together two separate issues – LoT versus DNN, on the one hand, and learning versus innateness, on the other. Although proponents of a DNN modeling do tend to lean empiricist, this sociological fact can be misleading. In actuality, fans of DNN-style representational formats need to have no commitment whatsoever on the issue of innateness. It could well be that a chick, or any other critter, inherits a “frozen” pretrained DNN-style representational system as a part of its genetic endowment. Presumably, in the real world, such a system would have been “trained” into its innate structure over the course of the creature's evolutionary past – a process akin to selecting a particularly successful DNN out of several and then using it as a “seed” for training a new cohort of variants.
Before closing, let me draw attention to the authors' use of performance data in evaluating deep convolutional neural network (DCNN) models. On the one hand, they argue that “divergence between DCNN and human performance echoes independent evidence that DCNNs fail to encode human-like transformation-invariant object representations” (target article, sect. 5.1, para. 6). On the other, they are steadfastly committed to a competence/performance distinction, which renders the evidence that they cite questionable. As Firestone (Reference Firestone2020) points out, performance measures are often unreliable guides in assessing the psychological plausibility of a DCNN, whether in vision or in any other domain. In psycholinguistics, performance has long ceased to be a reliable sign of human competence (Pereplyotchik, Reference Pereplyotchik2017), and computational linguists disagree about what performance measures to use (e.g., Sellam et al., Reference Sellam, Yadlowsky, Wei, Saphra, D'Amour, Linzen and Pavlick2022), even in DCNNs that make no claim to psychological plausibility. Thus, in order to make their case for the inadequacy of DCNN models – again, in vision or any other domain – the authors would need to cite evidence that evaluates the competence of such models. How to do this is, at present, far from a settled matter, so the performance measures they rely on are almost certain to be equivocal.
In summary, a representative sample of the arguments in the target article simply fail. The interface problem provides no warrant for positing similarities between representational formats, and the evidence from neurocognitive, animal, and behavioral studies is inconclusive at best. It is, moreover, unclear whether the authors' central hypothesis is falsifiable.
In this commentary, my aim is to develop four separate objections to arguments in the target article. These concern (i) interfaces between representational formats, (ii) how to interpret the P600 ERP signature, (iii) the relation between deep neural network (DNN) models and innateness, and (iv) the significance of performance measures in evaluating DNNs.
Let's begin with what the authors call the “interface problem.” They argue that “if cognition is largely LoT-like, and perception feeds information to cognition, then we should expect at least some elements of perception to be LoT-like, because the two systems need to interface” (target article, sect. 4, para. 2). This claim, though common, is puzzling. If DNN models have demonstrated anything, it's that virtually any representational format can be transformed into any other, given suitable training. Names can be mapped to faces, spatial arrays to numerical quantities, letters into phonemes, intentions into motoric instructions, and so on. Moreover, DNNs routinely do this in ways that appear not to be sensitive to any syntactic properties of the interfacing representations (Arbib, Reference Arbib2003). This strongly suggests that two interfacing systems need not have much, if anything, in common with one another, regardless of whether either of them is language-of-thought (LoT)-like.
If that's correct, then it raises the larger question of what the interface problem was ever supposed to be. The issue has been heavily studied in the theory of action, but, tellingly, the prominent solutions in this area often multiply the number of representational formats, introducing a new kind of demonstrative concept (Butterfill & Sinigaglia, Reference Butterfill and Sinigaglia2014) or a “motor schema” that mediates between intentions and low-level motoric instructions (Mylopoulos & Pacherie, Reference Mylopoulos and Pacherie2017). As Christensen (Reference Christensen2021) in effect points out, the question of how such representations are mapped into one another is not going to be answered by reference to their representational format(s). The substantive questions are how such mappings arise and what happens on the occasions when they fail. Plausible answers to these questions will likely appeal to learning, innate endowment, and low-level neurocognitive mechanisms, but not to format.
Turn now to some of the neurocognitive evidence that the authors marshal. They argue that “structured relations in scene grammar display curious hallmarks of language-like formats. For instance, the P600 ERP increases for syntactic violations in language, and also increases for stimuli that violate visual scene ‘syntax’” (target article, sect. 4.2.2, para. 4). However, the P600 has a variety of interpretations, and not all these fit neatly with the authors' reasoning. For instance, the P600 may be a trigger for conscious reevaluation of a stimulus that has caused processing issues, regardless of whether the underlying representational system is LoT-like (Batterink & Neville, Reference Batterink and Neville2013; van Gaal et al., Reference van Gaal, Naccache, Meuwese, van Loon, Leighton, Cohen and Dehaene2014). More importantly, the P600 has been shown to reflect incongruence or discordance in tonal music (Featherstone, Morrison, Waterman, & MacGregor, Reference Featherstone, Morrison, Waterman and MacGregor2013), demonstrating that a representational format – in this case, that of musical cognition – need not involve predication, logical operations, or automatic inferential promiscuity in order to induce a P600 response.
The authors might reply that musical cognition is demonstratively sensitive to recursive structure and that it exhibits filler-role relations (Lerdahl & Jackendoff, Reference Lerdahl and Jackendoff1983). The representational format involved is, thus, arguably LoT-like. But this response raises the deeper question of what it would take to falsify their main proposal. If musical cognition meets only half of the criteria that they take to be indicative of an LoT-like format, does this constitute a refutation of their hypothesis that such criteria naturally cluster together? If not, then what would?
Let's now consider issues in animal cognition. The authors argue that the paucity of relevant input to a newborn chick's visual system prior to an experiment “points away from DNN-based explanations of abstract object representations” (target article, sect. 5.1, para. 5). The idea seems to be that, if a representational capacity is innate, rather than acquired through some type of learning, then DNN models of this capacity are superfluous. But this argument runs together two separate issues – LoT versus DNN, on the one hand, and learning versus innateness, on the other. Although proponents of a DNN modeling do tend to lean empiricist, this sociological fact can be misleading. In actuality, fans of DNN-style representational formats need to have no commitment whatsoever on the issue of innateness. It could well be that a chick, or any other critter, inherits a “frozen” pretrained DNN-style representational system as a part of its genetic endowment. Presumably, in the real world, such a system would have been “trained” into its innate structure over the course of the creature's evolutionary past – a process akin to selecting a particularly successful DNN out of several and then using it as a “seed” for training a new cohort of variants.
Before closing, let me draw attention to the authors' use of performance data in evaluating deep convolutional neural network (DCNN) models. On the one hand, they argue that “divergence between DCNN and human performance echoes independent evidence that DCNNs fail to encode human-like transformation-invariant object representations” (target article, sect. 5.1, para. 6). On the other, they are steadfastly committed to a competence/performance distinction, which renders the evidence that they cite questionable. As Firestone (Reference Firestone2020) points out, performance measures are often unreliable guides in assessing the psychological plausibility of a DCNN, whether in vision or in any other domain. In psycholinguistics, performance has long ceased to be a reliable sign of human competence (Pereplyotchik, Reference Pereplyotchik2017), and computational linguists disagree about what performance measures to use (e.g., Sellam et al., Reference Sellam, Yadlowsky, Wei, Saphra, D'Amour, Linzen and Pavlick2022), even in DCNNs that make no claim to psychological plausibility. Thus, in order to make their case for the inadequacy of DCNN models – again, in vision or any other domain – the authors would need to cite evidence that evaluates the competence of such models. How to do this is, at present, far from a settled matter, so the performance measures they rely on are almost certain to be equivocal.
In summary, a representative sample of the arguments in the target article simply fail. The interface problem provides no warrant for positing similarities between representational formats, and the evidence from neurocognitive, animal, and behavioral studies is inconclusive at best. It is, moreover, unclear whether the authors' central hypothesis is falsifiable.
Acknowledgments
My thanks to Jacob Berger, Daniel Harris, and Myrto Mylopoulos for helpful discussion.
Competing interest
None.