Learning quantification from images: A structured neural architecture

I. SORODOC; S. PEZZELLE; A. HERBELOT; M. DIMICCOLI; R. BERNARDI

doi:10.1017/S1351324918000128

Learning quantification from images: A structured neural architecture

Published online by Cambridge University Press: 02 April 2018

I. SORODOC ,

M. DIMICCOLI and

I. SORODOC: Affiliation:
Center for Mind/Brain Sciences (CIMeC), University of Trento, Palazzo Fedrigotti - corso Bettini 31, 38068 Rovereto (TN), Italy e-mails: [email protected], [email protected], [email protected], [email protected]
S. PEZZELLE: Affiliation:
Center for Mind/Brain Sciences (CIMeC), University of Trento, Palazzo Fedrigotti - corso Bettini 31, 38068 Rovereto (TN), Italy e-mails: [email protected], [email protected], [email protected], [email protected]
A. HERBELOT: Affiliation:
Center for Mind/Brain Sciences (CIMeC), University of Trento, Palazzo Fedrigotti - corso Bettini 31, 38068 Rovereto (TN), Italy e-mails: [email protected], [email protected], [email protected], [email protected]
M. DIMICCOLI: Affiliation:
University of Barcelona, Gran via de les Corts Catalanes 585, 08007 Barcelona, Spain e-mail: [email protected] Computer Vision Center, Edificio O, Campus UAB, 08193 Bellaterra (Cerdanyola), Barcelona, Spain
R. BERNARDI: Affiliation:
Center for Mind/Brain Sciences (CIMeC), University of Trento, Palazzo Fedrigotti - corso Bettini 31, 38068 Rovereto (TN), Italy e-mails: [email protected], [email protected], [email protected], [email protected] Department of Information Engineering and Computer Science (DISI), University of Trento, Via Sommarive, 9 I-38123 Povo (TN), Italy

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Major advances have recently been made in merging language and vision representations. Most tasks considered so far have confined themselves to the processing of objects and lexicalised relations amongst objects (content words). We know, however, that humans (even pre-school children) can abstract over raw multimodal data to perform certain types of higher level reasoning, expressed in natural language by function words. A case in point is given by their ability to learn quantifiers, i.e. expressions like few, some and all. From formal semantics and cognitive linguistics, we know that quantifiers are relations over sets which, as a simplification, we can see as proportions. For instance, in most fish are red, most encodes the proportion of fish which are red fish. In this paper, we study how well current neural network strategies model such relations. We propose a task where, given an image and a query expressed by an object–property pair, the system must return a quantifier expressing which proportions of the queried object have the queried property. Our contributions are twofold. First, we show that the best performance on this task involves coupling state-of-the-art attention mechanisms with a network architecture mirroring the logical structure assigned to quantifiers by classic linguistic formalisation. Second, we introduce a new balanced dataset of image scenarios associated with quantification queries, which we hope will foster further research in this area.

Type: Articles
Information: Natural Language Engineering , Volume 24 , Special Issue 3: Language for Images , May 2018 , pp. 363 - 392

DOI: https://doi.org/10.1017/S1351324918000128 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2018

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Anderson, A. J., Bruni, E., Bordignon, U., Poesio, M., and Baroni, M. 2013. Of words, eyes and brains: correlating image-based distributional semantic models with neural representations of concepts. In EMNLP, pp. 1960–70.Google Scholar

Andreas, J., Rohrbach, M., Darrell, T., and Klein, D., 2016a. Learning to compose neural networks for question answering. In Proceedings of NAACL-HLT, San Diego, California: Association for Computational Linguistics, p. 1545–1554.Google Scholar

Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. 2016b. Neural module networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition.CrossRef Google Scholar

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. 2015. VQA: Visual question answering. In International Conference on Computer Vision (ICCV).Google Scholar

Baroni, M., Bernardi, R., Do, N.-Q., and Shan, C.-c. 2012. Entailment above the word level in distributional semantics. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 23–32.Google Scholar

Baroni, M., Bernardini, S., Ferraresi, A., and Zanchetta, E. 2009. The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3):209–26.Google Scholar

Baroni, M., Dinu, G., and Kruszewski, G. 2014. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL (1), pp. 238–47.Google Scholar

Barwise, J., and Cooper, R., 1981. Generalized quantifiers and natural language. Linguistics and Philosophy 4 (2): 159–219.Google Scholar

Bass, B. M., Cascio, W. F., and O’connor, E. J., 1974. Magnitude estimations of expressions of frequency and amount. Journal of Applied Psychology 59 (3): 313.Google Scholar

Boleda, G., and Herbelot, A., 2016. Formal distributional semantics: Introduction to the special issue. Computational Linguistics 42 (4): 619–35.Google Scholar

Borji, A., Cheng, M., Jiang, H., and Li, J., 2015. Salient object detection: A benchmark. IEEE Transactions on Image Processing 24 (12): 5706–5722.Google Scholar

Chattopadhyay, P., Vedantam, R., Selvaraju, R. R., Batra, D., and Parikh, D. 2017. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, Hawaii, pp. 1135–1144.Google Scholar

Coventry, K. R., Cangelosi, A., Newstead, S., Bacon, A., and Rajapakse, R. 2005. Grounding natural language quantifiers in visual attention. In Proceedings of the 27th Annual Conference of the Cognitive Science Society, Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar

Coventry, K. R., Cangelosi, A., Newstead, S. E., and Bugmann, D., 2010. Talking about quantities in space: Vague quantifiers, context and similarity. Language and Cognition 2 (2): 221–41.Google Scholar

Dehaene, S., and Changeux, J., 1993. Development of elementary numerical abilities: A neuronal model. Journal of Cognitive Neuroscience 5 (4): 390–407.Google Scholar

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 248–55.Google Scholar

Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., and Rohrbach, M. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Conference on Empirical Methods in Natural Language Processing (EMNLP).Google Scholar

Gao, H., Mao, J., Zhou, J., Huang, Z., and Yuille, A. 2015. Are you talking to a machine? dataset and methods for multilingual image question answering. In International Conference on Learning Representations.Google Scholar

Geman, D., GErman, S., Hallonquist, N., and Younes, L., 2015. Visual turing test for computer vision systems. PNAS 112 (12): 3618–23.Google Scholar

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D., 2016. Making the V in VQA matter: elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, Hawaii, pp. 6904–6913.Google Scholar

Halberda, J., Taing, L., and Lidz, J., 2008. The development of “most” comprehension and its potential dependence on counting ability in preschoolers. Language Learning and Development 4 (2): 99–121.Google Scholar

Hammerton, M. 1976. How much is a large part? Applied ergonomics 7 (1): 10–12.Google Scholar

Herbelot, A., and Vecchi, E. M. 2015. Building a shared world: mapping distributional to model-theoretic semantic spaces. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal.Google Scholar

Hodosh, M., Young, P., and Hockenmaier, J., 2013. Framing image description as a ranking task: data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 : 853–99.Google Scholar

Holyoak, K. J., and Glass, A. L., 1978. Recognition confusions among quantifiers. Journal of verbal learning and verbal behavior 17 (3): 249–64.Google Scholar

Hurewitz, F., Papafragou, A., Gleitman, L., and Gelman, R., 2006. Asymmetries in the acquisition of numbers and quantifiers. Language learning and development 2 (2): 77–96.Google Scholar

Johnson, J., Hariharan, B., van~der~Maaten, L., Fei-Fei, L., Zitnick, C. L., and Girshick, R. 2017. Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of CVPR.Google Scholar

Keenan, E., and Paperno, D., editors 2012. Handbook of Quantifiers in Natural Language. Springer Netherlands, Dordrecht.Google Scholar

Khemlani, S., Leslie, S.-J., and Glucksberg, S., 2009. Generics, prevalence, and default inferences. In Proceedings of the 31st annual conference of the Cognitive Science Society, Austin, TX: Cognitive Science Society, pp. 443–8.Google Scholar

Kumar, A., Irsoy, O., Su, J., Bradbury, J., E, R.., Pierce, B., Ondruska, P., Gulrajani, I., and Socher, R. 2016. Ask me anything: dynamic memory networks for natural language processing. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar

Lazaridou, A., Pham, N. T., and Baroni, M. 2015. Combining language and vision with a multimodal skip-gram model. In Proceedings of NAACL.Google Scholar

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C. L. 2014a. Microsoft COCO: common objects in context. In Proceedings of ECCV (European Conference on Computer Vision).Google Scholar

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C. L. 2014b. Microsoft coco: common objects in context. In Microsoft COCO: Common Objects in Context.Google Scholar

Ma, L., Lu, Z., and Li, H. 2016. Learning to answer questions from image using convolutional neural network. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI).Google Scholar

Malinowski, M., and Fritz, M., 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS14), Montreal, Canada, pp. 1682–1690.Google Scholar

Malinowski, M., Rohrbach, M., and Fritz, M. 2015. Ask your neurons: a neural-based approach to answering questions about images. In In International Conference on Computer Vision (ICCV’15).Google Scholar

McCrink, K., and Wynn, K., 2004. Large-number addition and subtraction by 9-month-old infants. Psychological Science 15 (11): 776–81.Google Scholar

Mikolov, T., Chen, K., Corrado, G., and Dean, J., 2013. Efficient estimation of word representations in vector space. In Proceedings of the 26th International Conference on Neural Information Processing Systems 26 (NIPS 2013), Lake Tahoe, Nevada, pp. 3111–3119.Google Scholar

Moxey, L. M., and Sanford, A. J. 1993. Communicating quantities: a psychological perspective. Lawrence Erlbaum Associates, Inc, Mahwah, NJ.Google Scholar

Nouwen, R. 2010. What’s in a quantifier? The Linguistics Enterprise: From knowledge of language to knowledge in linguistics 150: 235.Google Scholar

Patterson, G., and Hays, J. 2016. Coco attributes: attributes for people, animals, and objects. In European Conference on Computer Vision.Google Scholar

Pezzelle, S., Marelli, M., and Bernardi, R. 2017. Be precise or fuzzy: learning the meaning of cardinals and quantifiers from vision. In Proceedings of EACL.Google Scholar

Piantadosi, S. T. 2011. Learning and the language of thought. PhD thesis, Massachusetts Institute of Technology.Google Scholar

Piantadosi, S. T., Tenenbaum, J. B., and Goodman, N. D. 2012. Modeling the acquisition of quantifier semantics: a case study in function word learnability. https://colala.bcs.rochester.edu/papers/piantadosi2012modeling.pdf.Google Scholar

Rajapakse, R., Cangelosi, A., Conventry, K., Newstead, S., and Bacon, A. 2005. Grounding linguistic quantifiers in perception: Experiments on numerosity judgments. In Proceeding of the 2nd Language and Technology Conference, Poland.Google Scholar

Ren, M., Kiros, R., and Zemel, R. 2015a. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems (NIPS).Google Scholar

Ren, M., Kiros, R., and Zemel, R. 2015b. Image question answering: A visual semantic embedding model and a new dataset. In International Conference on Machine Learning Deep Learning Workshop.Google Scholar

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L., 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3): 211–52.Google Scholar

Seguí, S., Pujol, O., and Vitria, J. 2015. Learning to count with deep object features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 90–6.Google Scholar

Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.Google Scholar

Sorodoc, I., Lazaridou, A., G. B. A. H., Pezzelle, S., and Bernardi, R., 2016. “Look, some green circles!”: Learning to quantify from image. In Proceedings of the 5th Workshop on Vision and Language, Berlin, Germany: Association for Computational Linguistics, p. 75–79.Google Scholar

Stoianov, I., and Zorzi, M., 2012. Emergence of a’visual number sense’in hierarchical generative models. Nature Neuroscience 15 (2): 194–6.Google Scholar

Sukhbaatar, S., Szlam, A., Weston, J., and Fergus, R. 2015. End-to-end memory networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS), vol. 28.Google Scholar

Szabolsci, A., 2010. Quantification. Cambridge, UK: Cambridge University Press.Google Scholar

Trott, A., Xiong, C., and Socher, R. 2017. Interpretable counting for visual question answering. https://arxiv.org/abs/1712.08697.Google Scholar

van Benthem, J., 1986. Essays in logical semantics. Dordrecht, The Netherlands: Reidel Publishing Co.Google Scholar

Vedaldi, A., and Lenc, K. 2015. MatConvNet – Convolutional Neural Networks for MATLAB. In Proceeding of the ACM International Conference on Multimedia.Google Scholar

Weston, J., Chopra, S., and Bordes, A. 2015. Memory networks. In International Conference on Learning Representations (ICLR).Google Scholar

Xiong, C., Merity, S., and Socher, R. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of International Conference on Machine Learning (ICML).Google Scholar

Xu, F., and Spelke, E. S. 2000. Large number discrimination in 6-month-old infants. Cognition 74 (1):B1–B11.Google Scholar

Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., and Bengio, Y. 2015. Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar

Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. 2016. Stacked attention networks for imagequestion answering. In Proceedings of CVPR.Google Scholar

Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. J. 2015. Stacked attention networks for image question answering. CoRR, abs/1511.02274.Google Scholar

Zhang, J., Ma, S., Sameki, M., Sclaroff, S., Betke, M., Lin, Z., Shen, X., Price, B., and ech, R. M. 2015. Salient object subitizing. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar

Zhang, P., Goyal, Y., Summers-Stay, D., Batra, D., and Parikh, D. 2016. Yin and yang: balancing and answering binary visual questions. In Proceedings of CVPR.Google Scholar

Zhou, B., Tian, Y., Suhkbaatar, S., Szlam, A., and Fergus, R. 2015. Simple baseline for visual question answering. Technical report, arXiv:1512.02167, 2015.Google Scholar

Article contents

Learning quantification from images: A structured neural architecture

Abstract

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests