Understanding visual scenes

CARINA SILBERER; JASPER UIJLINGS; MIRELLA LAPATA

doi:10.1017/S1351324918000104

Understanding visual scenes

Published online by Cambridge University Press: 28 March 2018

CARINA SILBERER ,

JASPER UIJLINGS and

MIRELLA LAPATA

Show author details

CARINA SILBERER: Affiliation:
DTCL, Universitat Pompeu Fabra, Roc Boronat 138, 08018 Barcelona, Spain e-mail: [email protected]
JASPER UIJLINGS: Affiliation:
School of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh, EH8 9AB, UK e-mail: [email protected]
MIRELLA LAPATA: Affiliation:
ILCC, School of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh, EH8 9AB, UK e-mail: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

A growing body of recent work focuses on the challenging problem of scene understanding using a variety of cross-modal methods which fuse techniques from image and text processing. In this paper, we develop representations for the semantics of scenes by explicitly encoding the objects detected in them and their spatial relations. We represent image content via two well-known types of tree representations, namely constituents and dependencies. Our representations are created deterministically, can be applied to any image dataset irrespective of the task at hand, and are amenable to standard NLP tools developed for tree-based structures. We show that we can apply syntax-based SMT and tree kernel methods in order to build models for image description generation and image-based retrieval. Experimental results on real-world images demonstrate the effectiveness of the framework.

Type: Articles
Information: Natural Language Engineering , Volume 24 , Special Issue 3: Language for Images , May 2018 , pp. 441 - 465

DOI: https://doi.org/10.1017/S1351324918000104 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2018

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Aditya, S., Baral, C., Yang, Y., Aloimonos, Y., and Fermuller, C. 2016. DeepIU: an Architecture for image understanding. In Proceedings of Advances in Cognitive Systems.Google Scholar

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., and Parikh, D. 2015. VQA: visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar

Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., and Zitnick, C. L. 2015. Microsoft COCO captions: data collection and evaluation server. ArXiv e-prints, abs/1504.00325v2.Google Scholar

Collins, M. and Duffy, N. 2001. Convolution kernels for natural language. In Proceedings of the 14th International Conference on Advances in Neural Information Processing Systems: Natural and Synthetic, pp. 625–32.Google Scholar

Coyne, B. and Sproat, R. 2001. WordsEye: an automatic text-to-scene conversion system. In SIGGRAPH '01: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques.Google Scholar

Culotta, A. and Sorensen, J. 2004. Dependency tree kernels for relation extraction. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics.Google Scholar

Deng, Y., Kanervisto, A., Ling, J. and Rush, A. M. 2017. Image-to-markup generation with coarse-to-fine attention. In Proceedings of the 34th International Conference on Machine Learning, pp. 980–89.Google Scholar

Devlin, J., Gupta, S., Girshick, R. B., Mitchell, M. and Zitnick, C. L. 2015. Exploring nearest neighbor approaches for image captioning. ArXiv e-prints, abs/1505.04467.Google Scholar

Elliott, D. 2015. Structured Representation of Images for Language Generation and Image Retrieval. PhD Thesis, Edinburgh, Scotland, UK: The University of Edinburgh.Google Scholar

Elliott, D. and de Vries, A. 2015. Describing images using inferred visual dependency representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 42–52.Google Scholar

Elliott, D. and Keller, F. 2013 (October). Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1292–1302.Google Scholar

Elliott, D., Lavrenko, V. and Keller, F. 2014. Query-by-example image retrieval using visual dependency representations. In COLING 2014, 25th International Conference on Computational Linguistics, pp. 109–20.Google Scholar

Girshick, R. 2015. Fast R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV), pp. 1440–8.Google Scholar

Gupta, S. and Malik, J. 2015. Visual Semantic Role Labeling. ArXiv e-prints, abs/1505.04474v1.Google Scholar

Heafield, K., Pouzyrevsky, I., Clark, J. H. and Koehn, P. 2013. Scalable modified Kneser–Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 690–6.Google Scholar

Huang, L. 2006. Statistical syntax-directed translation with extended domain of locality. In Proceedings of the Association for Machine Translation in the Americas, pp. 66–73.Google Scholar

Huang, T.-H. (Kenneth), Ferraro, F., Mostafazadeh, N., Misra, I., Agrawal, A., Devlin, J., Girshick, R., He, X., Kohli, P., Batra, D., Zitnick, C. L., Parikh, D., Vanderwende, L., Galley, M., and Mitchell, M. 2016. Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: HLT, pp. 1233–9.Google Scholar

Jégou, H., Douze, M. and Schmid, C. 2008. Hamming Embedding and Weak Geometry Consistency for Large Scale Image Search – Extended Version. Research Report 6709. Inria Grenoble, Rhône-Alpes, France.Google Scholar

Johnson, J., Krishna, R., Stark, M., Li, L.-J., Shamma, D. A., Bernstein, M. S., and Fei-Fei, L. 2015. Image retrieval using scene graphs. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp. 3668–78.Google Scholar

Kafle, K. and Kanan, C. 2016. Answer-type prediction for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4976–84.Google Scholar

Karger, D. R., Klin, P. N. and Tarjan, R. E. 1995. A randomized linear-time algorithm to find minimum spanning t rees. Journal of the ACM 42 (2): 321–8.Google Scholar

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–80.Google Scholar

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., Bernstein, M. S., and Fei-Fei, L. 2017. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), 32–73.Google Scholar

Kruskal, J. B. 1956. On the shortest spanning subtree of a graph and the traveling salesman problem. In Proceedings of the American Mathematical Society, 7.Google Scholar

Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., and Berg, T. L. 2011. Baby talk: understanding and generating simple image descriptions. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, pp. 1601–8.Google Scholar

Lan, T., Yang, W., Wang, Y. and Mori, G. 2012. Image retrieval with structured object queries using latent ranking SVM. In Proceedings of the 12th European Conference on Computer Vision, pp. 129–42.Google Scholar

Li, S., Kulkarni, G., Berg, T. L., Berg, A. C. and Choi, Y. 2011. Composing simple image descriptions using web-scale N-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning, pp. 220–8.Google Scholar

Lin, D., Fidler, S., Kong, C. and Urtasun, R. 2015. Generating multi-sentence lingual descriptions of indoor scenes. In Proceedings of the British Machine Vision Conference.Google Scholar

Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A., Yamaguchi, K., Berg, T., Stratos, K., and Daumé, H. III 2012. Midge: generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 747–56.Google Scholar

Moschitti, A. 2006a. Efficient convolution kernels for dependency and constituent syntactic trees. In Proceedings of the 17th European Conference on Machine Learning, pp. 318–29.Google Scholar

Moschitti, A. 2006b. Making tree kernels practical for natural language learning. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 113–20.Google Scholar

Och, F. J. and Ney, H. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1): 19–51.Google Scholar

Ortiz, L. G. M., Wolff, C. and Lapata, M. 2015. Learning to interpret and describe abstract scenes. In Proceedings of the 2015 North American Chapter of the Association for Computational Linguistics: HLT, pp. 1505–15.Google Scholar

Palmer, M., Gildea, D. and Kingsbury, P. 2005. The proposition bank: an annotated corpus of semantic roles. Computational Linguistics 31 (1): 71–106.Google Scholar

Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–8.Google Scholar

Philbin, J., Chum, O., Isard, M., Sivic, J., and Zisserman, A. 2007. Object retrieval with large vocabularies and fast spatial matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar

Philbin, J., Chum, O., Isard, M., Sivic, J., and Zisserman, A. 2008. Lost in quantization: improving particular object retrieval in large scale image databases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar

Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S. 2017. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision 123 (1), 74–93.Google Scholar

Prim, R. C. 1957. Shortest connection networks And some generalization. Bell System Technical Journal 36 (6), 1389–401.Google Scholar

Roth, M. and Lapata, M. 2016. Neural semantic role labeling with dependency path embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), Berlin, Germany, pp. 1192–1202.Google Scholar

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115 (3): 211–52.CrossRef Google Scholar

Schuster, S., Krishna, R., Chang, A., Fei-Fei, L., and Manning, C. D. 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the 4th Workshop on Vision and Language, pp. 70–80.Google Scholar

Simonyan, K. and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. ArXiv e-prints, abs/1409.1556v6.Google Scholar

Uijlings, J. R. R., van de Sande, K. E. A., Gevers, T., and Smeulders, A. W. M. 2013. Selective search for object recognition. International Journal of Computer Vision 104 (2): 154–71.Google Scholar

Vedantam, R., Zitnick, C. L. and Parikh, D. 2015. CIDEr: consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–75.Google Scholar

Vinyals, O., Toshev, A., Bengio, S. and Erhan, D. 2015. Show and tell: a neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–64.Google Scholar

Vishwanathan, S. V. N. and Smola, A. J. 2002. Fast kernels for string and tree matching. Advances in Neural Information Processing Systems 15: Annual Conference on Neural Information Processing Systems, pp. 569–76.Google Scholar

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. 2015. Show, attend and tell: neural image caption generation with visual attention. In Blei, D., and Bach, F. (eds.), Proceedings of the 32nd International Conference on Machine Learning, pp. 2048–57. JMLR Workshop and Conference Proceedings.Google Scholar

Yatskar, M., Ordonez, V. and Farhadi, A. 2016a. Stating the obvious: extracting visual common sense knowledge. In Proceedings of the 2016 Conference of the NAACL: Human Language Technologies, pp. 193–8.Google Scholar

Yatskar, M., Zettlemoyer, L. and Farhadi, A. 2016b. Situation recognition: visual semantic role labeling for image understanding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar

Young, P., Lai, A., Hodosh, M. and Hockenmaier, J. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2: 67–78.Google Scholar

Yu, L., Poirson, P., Yang, S., Berg, A. C., and Berg, T. L. 2016. Modeling context in referring expressions. In ECCV.Google Scholar

Zampogiannis, K., Yang, Y., Fermüller, C., and Aloimonos, Y. 2015. Learning the spatial semantics of manipulation actions through preposition grounding. In Proceedigs of the IEEE International Conference on Robotics and Automation, pp. 1389–96.Google Scholar

Zitnick, C. L., Vedantam, R. and Parikh, D. 2016. Adopting abstract images for semantic scene understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (4): 627–38.Google Scholar

Article contents

Understanding visual scenes

Abstract

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests