The target article offers cogent criticisms of deep neural networks (DNNs) as models of human cognition. Although discriminative DNNs are currently dominant in cognitive modeling, other approaches are needed if we are to achieve a satisfactory understanding of human cognition. We suggest that generative models are a promising avenue for future work, particularly capacity-limited, generative models designed around componential representations (e.g., part-based representations of visual objects and scenes).
A generative model is a model that learns a joint distribution of visible (i.e., observed) and hidden (latent) variables. Importantly, many generative models allow us to sample from the distribution learned by the model, producing “synthetic” examples of the concept modeled by the distribution. Using a generative model to make inferences about external stimuli is a matter of identifying the properties of the generative model most likely to have produced these stimuli. By their very nature, generative models neatly sidestep many of the issues with discriminative models, as described in the target article.
Most obviously yet perhaps most importantly, they are typically judged not based on predictive performance, but on their ability to synthesize examples of concepts, which requires a more profound understanding of those concepts than does mere discrimination, potentially leading to task-general representations capable of explaining far more of human perceptual and cognitive reasoning. For example, unlike discriminative models trained to categorize images, which tend to base their decisions on texture patches and local shape instead of global shape as humans do, a successful generative model must include an understanding of global object shape, as otherwise its samples would not be realistic. Inference in such a generative model would therefore be sensitive to object shape as a matter of course, as well as a number of other properties that might be ignored by a discriminatively trained model.
Another important feature of human cognition not captured by large DNNs is capacity limits. People cannot remember all aspects of a visual environment, and so human vision needs to be selective and efficient. By contrast, DNNs often contain billions of adaptable parameters, providing them with enormous learning, representational, and processing capacities. These seemingly unlimited capacities are in stark contrast to the dramatically limited capacities of biological vision, as noted in the target article. This need for efficiency underlies people's attentional and memory biases. People are biased toward “filling-in” missing features (i.e., features not attended or remembered) with values that are highly frequent in the environment. In addition, people are biased toward attending to and remembering those features which are most relevant for their current goal, thereby maximizing task performance.
Bates, Lerch, Sims, and Jacobs (Reference Bates, Lerch, Sims and Jacobs2019) experimentally evaluated these biases using an optimal model of capacity-limited visual working memory (VWM) based on “rate-distortion theory” (RDT; see Sims, Jacobs, & Knill, Reference Sims, Jacobs and Knill2012). Both biases were predicted by the RDT model: An optimal VWM should be biased toward allocating its limited memory resources toward high-probability feature values and toward task-relevant features. Bates and Jacobs (Reference Bates and Jacobs2021) studied people's responses in the domain of visual search and attention. The RDT model predicted important aspects of these responses, including “set-size” effects indicative of limited capacity, aspects not accounted for by a model based on Bayesian decision theory.
In accord with these ideas, a popular form of generative model, a “variational autoencoder” (VAE) uses a loss function during training that penalizes a large growth in capacity. A VAE maps an input through one or more hidden layers, with a penalized capacity at one of the layers, to an output layer that attempts to reconstruct the input. Reconstructions are typically imperfect due to the “lossy” representations at the “bottleneck” hidden layer with restricted capacity. Machine learning researchers have shown important mathematical relationships between VAEs and RDT (Alemi et al., Reference Alemi, Poole, Fischer, Dillon, Saurous and Murphy2017, Reference Alemi, Poole, Fischer, Dillon, Saurous and Murphy2018; Ballé, Laparra, & Simoncelli, Reference Ballé, Laparra and Simoncelli2016; Burgess et al., Reference Burgess, Higgins, Pal, Matthey, Watters, Desjardins and Lerchner2018). Bates and Jacobs (Reference Bates and Jacobs2020) used VAEs to model biases and set-size effects in human visual perception and memory. We believe this is an encouraging early step toward developing capacity-limited, generative models of human vision.
The desire for efficient representations also leads to componential or part-based approaches, and generative models naturally lend themselves to understanding concepts based on parts and relationships between them, as humans do (in contrast to DNNs, as the target article points out, citing German and Jacobs, Reference German and Jacobs2020, and Erdogan and Jacobs, Reference Erdogan and Jacobs2017). The same basic parts can be used to create a wide variety of distinct objects, just by changing the relationships between them (the basis of many perceptual and cognitive models such as Biederman, Reference Biederman1987). Learning new object concepts thereby becomes more efficient, as once a part has been learned, it can be used in the representation and construction of any object concept using it, including new ones. This idea can be further extended by supposing that parts are made out of subparts, and so on, producing hierarchical, componential generative models (e.g., Lake, Salakhutdinov, & Tenenbaum, Reference Lake, Salakhutdinov and Tenenbaum2015; Nash & Williams, Reference Nash and Williams2017).
To be sure, a capacity-limited, generative approach is not going to “solve” cognitive modeling overnight. It still faces major obstacles such as computationally expensive inference and a lack of objective criteria with which to judge the quality of its synthesized instances. However, we are optimistic that these issues can be resolved, and we hope the target article inspires researchers to look beyond the established discriminative DNN paradigm. Perhaps if capacity-limited, generative models receive as much research attention and development as discriminative models have, we can look forward to significant advances in both computational cognitive modeling and machine learning.
The target article offers cogent criticisms of deep neural networks (DNNs) as models of human cognition. Although discriminative DNNs are currently dominant in cognitive modeling, other approaches are needed if we are to achieve a satisfactory understanding of human cognition. We suggest that generative models are a promising avenue for future work, particularly capacity-limited, generative models designed around componential representations (e.g., part-based representations of visual objects and scenes).
A generative model is a model that learns a joint distribution of visible (i.e., observed) and hidden (latent) variables. Importantly, many generative models allow us to sample from the distribution learned by the model, producing “synthetic” examples of the concept modeled by the distribution. Using a generative model to make inferences about external stimuli is a matter of identifying the properties of the generative model most likely to have produced these stimuli. By their very nature, generative models neatly sidestep many of the issues with discriminative models, as described in the target article.
Most obviously yet perhaps most importantly, they are typically judged not based on predictive performance, but on their ability to synthesize examples of concepts, which requires a more profound understanding of those concepts than does mere discrimination, potentially leading to task-general representations capable of explaining far more of human perceptual and cognitive reasoning. For example, unlike discriminative models trained to categorize images, which tend to base their decisions on texture patches and local shape instead of global shape as humans do, a successful generative model must include an understanding of global object shape, as otherwise its samples would not be realistic. Inference in such a generative model would therefore be sensitive to object shape as a matter of course, as well as a number of other properties that might be ignored by a discriminatively trained model.
Another important feature of human cognition not captured by large DNNs is capacity limits. People cannot remember all aspects of a visual environment, and so human vision needs to be selective and efficient. By contrast, DNNs often contain billions of adaptable parameters, providing them with enormous learning, representational, and processing capacities. These seemingly unlimited capacities are in stark contrast to the dramatically limited capacities of biological vision, as noted in the target article. This need for efficiency underlies people's attentional and memory biases. People are biased toward “filling-in” missing features (i.e., features not attended or remembered) with values that are highly frequent in the environment. In addition, people are biased toward attending to and remembering those features which are most relevant for their current goal, thereby maximizing task performance.
Bates, Lerch, Sims, and Jacobs (Reference Bates, Lerch, Sims and Jacobs2019) experimentally evaluated these biases using an optimal model of capacity-limited visual working memory (VWM) based on “rate-distortion theory” (RDT; see Sims, Jacobs, & Knill, Reference Sims, Jacobs and Knill2012). Both biases were predicted by the RDT model: An optimal VWM should be biased toward allocating its limited memory resources toward high-probability feature values and toward task-relevant features. Bates and Jacobs (Reference Bates and Jacobs2021) studied people's responses in the domain of visual search and attention. The RDT model predicted important aspects of these responses, including “set-size” effects indicative of limited capacity, aspects not accounted for by a model based on Bayesian decision theory.
In accord with these ideas, a popular form of generative model, a “variational autoencoder” (VAE) uses a loss function during training that penalizes a large growth in capacity. A VAE maps an input through one or more hidden layers, with a penalized capacity at one of the layers, to an output layer that attempts to reconstruct the input. Reconstructions are typically imperfect due to the “lossy” representations at the “bottleneck” hidden layer with restricted capacity. Machine learning researchers have shown important mathematical relationships between VAEs and RDT (Alemi et al., Reference Alemi, Poole, Fischer, Dillon, Saurous and Murphy2017, Reference Alemi, Poole, Fischer, Dillon, Saurous and Murphy2018; Ballé, Laparra, & Simoncelli, Reference Ballé, Laparra and Simoncelli2016; Burgess et al., Reference Burgess, Higgins, Pal, Matthey, Watters, Desjardins and Lerchner2018). Bates and Jacobs (Reference Bates and Jacobs2020) used VAEs to model biases and set-size effects in human visual perception and memory. We believe this is an encouraging early step toward developing capacity-limited, generative models of human vision.
The desire for efficient representations also leads to componential or part-based approaches, and generative models naturally lend themselves to understanding concepts based on parts and relationships between them, as humans do (in contrast to DNNs, as the target article points out, citing German and Jacobs, Reference German and Jacobs2020, and Erdogan and Jacobs, Reference Erdogan and Jacobs2017). The same basic parts can be used to create a wide variety of distinct objects, just by changing the relationships between them (the basis of many perceptual and cognitive models such as Biederman, Reference Biederman1987). Learning new object concepts thereby becomes more efficient, as once a part has been learned, it can be used in the representation and construction of any object concept using it, including new ones. This idea can be further extended by supposing that parts are made out of subparts, and so on, producing hierarchical, componential generative models (e.g., Lake, Salakhutdinov, & Tenenbaum, Reference Lake, Salakhutdinov and Tenenbaum2015; Nash & Williams, Reference Nash and Williams2017).
To be sure, a capacity-limited, generative approach is not going to “solve” cognitive modeling overnight. It still faces major obstacles such as computationally expensive inference and a lack of objective criteria with which to judge the quality of its synthesized instances. However, we are optimistic that these issues can be resolved, and we hope the target article inspires researchers to look beyond the established discriminative DNN paradigm. Perhaps if capacity-limited, generative models receive as much research attention and development as discriminative models have, we can look forward to significant advances in both computational cognitive modeling and machine learning.
Financial support
This work was funded by NSF research grants BCS-1824737 and DRL-1561335.
Competing interest
None.