Hostname: page-component-7bb8b95d7b-s9k8s Total loading time: 0 Render date: 2024-10-01T06:47:21.889Z Has data issue: false hasContentIssue false

Let's move forward: Image-computable models and a common model evaluation scheme are prerequisites for a scientific understanding of human vision

Published online by Cambridge University Press:  06 December 2023

James J. DiCarlo
Affiliation:
Dept. of Brain and Cognitive Sciences, Quest for Intelligence, and McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, USA [email protected]; https://dicarlolab.mit.edu [email protected] [email protected]; https://evlab.mit.edu/ [email protected]; https://mschrimpf.com/
Daniel L. K. Yamins
Affiliation:
Wu Tsai Neurosciences Institute, Stanford University, Stanford, CA, USA [email protected] [email protected]; http://neuroailab.stanford.edu/research.html
Michael E. Ferguson
Affiliation:
Dept. of Brain and Cognitive Sciences, Quest for Intelligence, and McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, USA [email protected]; https://dicarlolab.mit.edu [email protected] [email protected]; https://evlab.mit.edu/ [email protected]; https://mschrimpf.com/
Evelina Fedorenko
Affiliation:
Dept. of Brain and Cognitive Sciences, Quest for Intelligence, and McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, USA [email protected]; https://dicarlolab.mit.edu [email protected] [email protected]; https://evlab.mit.edu/ [email protected]; https://mschrimpf.com/
Matthias Bethge
Affiliation:
Tübingen AI Center, University of Tübingen, Tübingen, Germany [email protected]; https://bethgelab.org/
Tyler Bonnen
Affiliation:
Wu Tsai Neurosciences Institute, Stanford University, Stanford, CA, USA [email protected] [email protected]; http://neuroailab.stanford.edu/research.html
Martin Schrimpf
Affiliation:
Dept. of Brain and Cognitive Sciences, Quest for Intelligence, and McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, USA [email protected]; https://dicarlolab.mit.edu [email protected] [email protected]; https://evlab.mit.edu/ [email protected]; https://mschrimpf.com/ École polytechnique fédérale de Lausanne, Lausanne, Switzerland

Abstract

In the target article, Bowers et al. dispute deep artificial neural network (ANN) models as the currently leading models of human vision without producing alternatives. They eschew the use of public benchmarking platforms to compare vision models with the brain and behavior, and they advocate for a fragmented, phenomenon-specific modeling approach. These are unconstructive to scientific progress. We outline how the Brain-Score community is moving forward to add new model-to-human comparisons to its community-transparent suite of benchmarks.

Type
Open Peer Commentary
Copyright
Copyright © The Author(s), 2023. Published by Cambridge University Press

Common ground

As vision scientists, we believe that an understanding of human visual processing should ultimately explain all visually driven behavior. Because vision operates – by definition – on visual input, a science of human vision ultimately requires “image-computable” models and theories that produce those models. Bowers et al. endorse this view as every psychology experiment they suggest focuses on the effects of manipulations of combinations of image pixels.

On empirical tests of vision models

As empirical vision scientists, we also believe that advances in understanding visual processing will arise from rigorous, community-transparent tests of model predictions against empirical observations from the brain (e.g., patterns of neural firing) and the mind (e.g., patterns of behavior). As such, we and others have contributed to the creation of an open-source platform where any member of the vision community can find the leading models, test new models, see the most model-disruptive experimental benchmarks, and add new benchmarks (www.brain-score.org; Schrimpf et al., Reference Schrimpf, Kubilius, Hong, Majaj, Rajalingham, Issa and DiCarlo2018, Reference Schrimpf, Kubilius, Lee, Murty, Apurva, Ajemian and DiCarlo2020).

The most constructive contribution of Bowers et al. is the identification of a set of human behavioral vision findings that the authors believe will not be well-predicted by currently leading deep artificial neural network (ANN) models (target article, sect. 4.1). To evaluate this claim, the Brain-Score community is turning these empirical findings into accessible benchmarks that current (and future) models of human visual processing can be evaluated on. The results of this evaluation, especially if these benchmarks indeed present a challenge for current ANN models, should and would motivate next steps in human vision modeling. We report the following status at the time of this writing:

On current vision models

We are not dogmatically committed to any current deep ANN model of human vision, none of which are perfect models of human vision, as the Brain-Score effort helped illuminate. However, we disagree with Bowers et al.'s claim that deep ANNs are not the currently leading models of human ventral visual processing. Bowers et al. critique ANN models without offering a better alternative: They imply that better models exist or should exist, but do not elaborate on what those models are. In the absence of an alternative model, it is justifiable to refer to ANNs as the currently best models. In fact, as can be seen on Brain-Score, in addition to the ability of some ANN models to moderately well predict neural responses at multiple visual processing stages, those same ANN models do, to some extent, predict even quite challenging behavioral data patterns (Geirhos et al., Reference Geirhos, Narayanappa, Mitzkus, Thieringer, Bethge, Wichmann and Brendel2021; Rajalingham et al., Reference Rajalingham, Issa, Bashivan, Kar, Schmidt and DiCarlo2018).

Bowers et al. eschew community-transparent suites of benchmarks yet they imply an alternative notion of vision model evaluation, which is somehow not a suite of benchmarks. But again, they do not produce a feasible alternative. Of course, the model rankings produced by benchmarks also depend on the choice of datasets and metric used for evaluation. We will continue to help the Brain-Score community expand the range of datasets and we are not dogmatically committed to any particular choice of metric. Different subcommunities may prefer to initially focus on different metrics (e.g., to know the currently best behavioral model regardless of underlying brain alignment, or vice-versa), and Brain-Score should support those different benchmark weightings. But we see no alternative to support advances in models of vision other than an open, transparent, and community-driven way of model comparison.

On building new vision models

Bowers et al. appear to favor a classic approach in which a separate model is built for each psychological phenomenon, using specialized stimuli that are hand-crafted to enable certain visual features to be well-defined – for example, illusory contours or shape primitives. The appeal of this approach is that it reduces the complexity of a high-dimensional pixel input space into small intuitive sets of features that enable the formulation and testing of conceptual hypotheses about vision – for example, the mechanisms of a particular class of visual illusions. However, because this approach requires dramatically restricting the stimuli under consideration, such hypotheses often cover a near-zero fraction of image space. In our opinion, the idea that a universal scientific model of human vision will result from sets of fragmented explanations that only engage a tiny fraction of image space is illusory (Newell, Reference Newell1973).

In contrast, the approach of starting with image-computable models that we favor enables tangible progress toward a unified model of human vision. Transparent tracking of model shortcomings lights the path to this goal. We acknowledge that the image-computability requirement may make formulation of traditional conceptual tests of a model more challenging. But it, by no means, makes such tests impossible. Any pattern of behavioral data, including those discussed in the target article, should be translatable into a behavioral benchmark on Brain-Score.

Moving forward

Ultimately, we think that the advantages that image-computable models have in enabling evaluation of predictions about diverse visual stimuli and phenomena heavily outweighs their disadvantages. And maintaining and expanding a common evaluation scheme for image-computable models of vision is, in our view, a prerequisite for channeling the valuable contributions of vision science – across neuroscience, cognitive science, psychology, and computer vision – toward convergence on the best scientific models of human vision. Let's move forward!

Acknowledgments

We thank Kohitij Kar, Micheal Lee, Nancy Kanwisher, Nikolaus Kriegeskorte, and Chris Shay for helpful discussions and support.

Financial support

This work was supported in part by the Semiconductor Research Company (SRC) and DARPA (J. J. D.), Simons Foundation (542965, J. J. D.), Office of Naval Research (MURI N00014-21-1-2801; N00014-20-1-2589, J. J. D., D. L. K. Y.), and National Science Foundation (2124136, J. J. D.).

Competing interest

M. B. is a co-founder of Maddox AI. All other authors have no competing interest.

References

Baker, N., & Elder, J. H. (2022). Deep learning models fail to capture the configural nature of human shape perception. iScience, 25(9), 104913.CrossRefGoogle ScholarPubMed
Bowers, J. S., & Jones, K. W. (2007). Detecting objects is easier than categorizing them. Quarterly Journal of Experimental Psychology, 61, 552557.CrossRefGoogle Scholar
Geirhos, R., Narayanappa, K., Mitzkus, B., Thieringer, T., Bethge, M., Wichmann, F. A., & Brendel, W. (2021). Partial success in closing the gap between human and machine vision. Advances in Neural Information Processing Systems, 34, 2388523899.Google Scholar
Mack, M. L., Gauthier, I., Sadr, J., & Palmeri, T. J. (2008). Object detection and basic-level categorization: Sometimes you know it is there before you know what it is. Psychonomic Bulletin & Review, 15(1), 2835.CrossRefGoogle ScholarPubMed
Newell, A. (1973). You can't play 20 questions with nature and win: Projective comments on the papers of this symposium. Visual information processing. Academic Press.CrossRefGoogle Scholar
Puebla, G., & Bowers, J. S. (2022). Can deep convolutional neural networks support relational reasoning in the same-different task? Journal of Vision, 22(10), 118.CrossRefGoogle ScholarPubMed
Rajalingham, R., Issa, E. B., Bashivan, P., Kar, K., Schmidt, K., & DiCarlo, J. J. (2018). Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys, and state-of-the-art deep artificial neural networks. Journal of Neuroscience, 38(33), 72557269.CrossRefGoogle ScholarPubMed
Saarela, T. P., Sayim, B., Westheimer, G., & Herzog, M. H. (2009). Global stimulus configuration modulates crowding. Journal of Vision, 9(2), 5.CrossRefGoogle ScholarPubMed
Schrimpf, M., Kubilius, J., Hong, H., Majaj, N. J., Rajalingham, R., Issa, E. B., … DiCarlo, J. J. (2018). Brain-Score: Which artificial neural network for object recognition is most brain-like? bioRxiv, 407007.Google Scholar
Schrimpf, M., Kubilius, J., Lee, M. J., Murty, R., Apurva, N., Ajemian, R., & DiCarlo, J. J. (2020). Integrative benchmarking to advance neurally mechanistic models of human intelligence. Neuron, 108(3), 413423.CrossRefGoogle ScholarPubMed
Spoerer, C. J., Kietzmann, T. C., Mehrer, J., Charest, I., & Kriegeskorte, N. (2020). Recurrent neural networks can explain flexible trading of speed and accuracy in biological vision. PLoS Computational Biology, 16(10), e1008215.CrossRefGoogle ScholarPubMed
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3), 107115.CrossRefGoogle Scholar