Hostname: page-component-78c5997874-lj6df Total loading time: 0 Render date: 2024-11-19T13:11:03.106Z Has data issue: false hasContentIssue false

From image to language and back again

Published online by Cambridge University Press:  23 April 2018

A. BELZ
Affiliation:
Computing, Engineering and Mathematics, University of Brighton, Lewes Road, Brighton BN2 4GJ, UK e-mail: [email protected]
T.L. BERG
Affiliation:
Computer Science, UNC Chapel Hill, Chapel Hill, NC 27599-3175, USA e-mail: [email protected], [email protected]
L. YU
Affiliation:
Computer Science, UNC Chapel Hill, Chapel Hill, NC 27599-3175, USA e-mail: [email protected], [email protected]

Extract

Work in computer vision and natural language processing involving images and text has been experiencing explosive growth over the past decade, with a particular boost coming from the neural network revolution. The present volume brings together five research articles from several different corners of the area: multilingual multimodal image description (Frank et al.), multimodal machine translation (Madhyastha et al., Frank et al.), image caption generation (Madhyastha et al., Tanti et al.), visual scene understanding (Silberer et al.), and multimodal learning of high-level attributes (Sorodoc et al.). In this article, we touch upon all of these topics as we review work involving images and text under the three main headings of image description (Section 2), visually grounded referring expression generation (REG) and comprehension (Section 3), and visual question answering (VQA) (Section 4).

Type
Articles
Copyright
Copyright © Cambridge University Press 2018 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Anderson, P., Fernando, B., Johnson, M., and Gould, S. 2016. Spice: semantic propositional image caption evaluation. In Proceedings of ECCV-2016, pp. 382–398.Google Scholar
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. 2017. Bottom-up and top-down attention for image captioning and VQA. arXiv preprint arXiv:1707.07998.Google Scholar
Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. 2016a. Learning to compose neural networks for question answering. In Proceedings of NAACL-2016.CrossRefGoogle Scholar
Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. 2016b. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 39–48.Google Scholar
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. 2015. VQA: visual Question Answering. In Proceedings of ICCV’15.Google Scholar
Belz, A. 2009. That’s nice. . . what can you do with it? Computational Linguistics 35 (1): 118119.Google Scholar
Belz, A., and Hastie, H. 2014. Comparative evaluation and shared tasks for NLG in interactive systems. In Natural Language Generation in Interactive Systems. Cambridge: CUP.Google Scholar
Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., and Plank, B., 2016. Automatic description generation from images: a survey of models, datasets, and evaluation measures. Journal of Artificial Intelligence Research 55: 409442.CrossRefGoogle Scholar
Chen, J., Kuznetsova, P., Warren, D., and Choi, Y. 2015a. Déjà image-captions: a corpus of expressive descriptions in repetition. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 504–514.Google Scholar
Chen, K., Wang, J., Chen, L.-C., Gao, H., Xu, W., and Nevatia, R. 2015b. ABC-CNN: an attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960.Google Scholar
Chen, X., and Zitnick, C. L. 2015. Mind’s eye: a recurrent visual representation for image caption generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2422–2431.Google Scholar
Dai, B., Lin, D., Urtasun, R., and Fidler, S. 2017. Towards diverse and natural image descriptions via a conditional GAN. ICCV 2017, arXiv preprint arXiv:1703.06029.Google Scholar
Dale, R., and Reiter, E., 1995. Computational interpretations of the Gricean maxims in the generation of referring expressions. Cognitive Science 19: 233264.Google Scholar
Dale, R., and Reiter, E., 2000. Building natural language generation systems. New York, NY: CUP.Google Scholar
Das, A., Agrawal, H., Zitnick, L., Parikh, D., and Batra, D. 2017. Human attention in visual question answering: do humans and deep networks look at the same regions? Computer Vision and Image Understanding 163: 90100.Google Scholar
De Marneffe, M.-C., et al. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC, vol. 6, Genoa, Italy, pp. 449454.Google Scholar
De Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., and Courville, A. 2017. Guesswhat?! Visual object discovery through multi-modal dialogue. In Proceedings of CVPR.Google Scholar
Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., Zweig, G., and Mitchell, M. 2015. Language models for image captioning: the quirks and what works. In Proceedings of CoRR, abs/1505.01809.Google Scholar
Elliott, D., and de Vries, A. 2015. Describing images using inferred visual dependency representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 42–52.Google Scholar
Elliott, D., Frank, S., Barrault, L., Bougares, F., and Specia, L. 2017. Findings of the second shared task on multimodal machine translation and multilingual image description. In Proceedings of CoRR, abs/1710.07177.Google Scholar
Elliott, D., Frank, S., Sima’an, K., and Specia, L. 2016. Multi30k: multilingual english-German image descriptions. In Proceedings of the 5th Workshop on Vision and Language, arXiv preprint arXiv:1605.00459.Google Scholar
Elliott, D., and Keller, F. 2013. Image description using visual dependency representations. In Proceedings of the 18th Conference on Empirical Methods in Natural Language Processing (EMNLP), Seattle, pp. 12921302.Google Scholar
Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollazr, P., Gao, J., He, X., Mitchell, M., Platt, J. C., Zitnick, C. L., and Zweig, G. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp. 14731482.Google Scholar
Fang, R., Liu, C., She, L., and Chai, J. 2013. Towards situated dialogue: revisiting referring expression generation. In Proceedings of EMNLP’13.Google Scholar
Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., and Forsyth, D. 2010. Every picture tells a story: generating sentences from images. In Proceedings of ECCV’10, pp. 15–29.Google Scholar
Feng, Y., and Lapata, M. 2008. Automatic image annotation using auxiliary text information. In Proceedings of ACL-2008: HLT, pp. 272–280.Google Scholar
FitzGerald, N., Artzi, Y., and Zettlemoyer, L. 2013. Learning distributions over logical forms for referring expression generation. In Proceedings of Empirical Methods on Natural Language Processing (EMNLP-2013).Google Scholar
Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., and Rohrbach, M. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP 2016, arXiv preprint arXiv:1606.01847.Google Scholar
Gan, C., Gan, Z., He, X., Gao, J., and Deng, L. 2017a. Stylenet: generating attractive visual captions with styles. In Proceedings of CVPR.Google Scholar
Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L., and Deng, L. 2017b. Semantic compositional networks for visual captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2.Google Scholar
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., and Xu, W. 2015. Are you talking to a machine? dataset and methods for multilingual image question. In Proceedings of Advances in Neural Information Processing Systems, pp. 2296–2304.Google Scholar
Gatt, A., and Belz, A. 2010. Introducing Shared Tasks to NLG: The TUNA Shared Task Evaluation Challenges, pp. 264293. Berlin, Heidelberg: Springer.Google Scholar
Gella, S., Lapata, M., and Keller, F. 2016. Unsupervised visual sense disambiguation for verbs using multimodal embeddings. In NAACL 2016, arXiv preprint arXiv:1603.09188.Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. 2017. Making the V in VQA matter: elevating the role of image understanding in Visual Question Answering. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR).CrossRefGoogle Scholar
Grice, H. P. 1975. Logic and conversation, pp. 4158. Cambridge, MA: Harvard University Press.Google Scholar
Grubinger, M., Clough, P., Müller, H., and Deselaers, T. 2006a. The IAPR TC-12 benchmark: a new evaluation resource for visual information systems. In Proceedings of International workshop ontoImage, vol. 5, p. 10.Google Scholar
Grubinger, M., Clough, P., Müller, H., and Deselaers, T. 2006b. The IAPR TC-12 benchmark: a new evaluation resource for visual information systems. In Proceedings of International Workshop OntoImage, vol. 5, p 10.Google Scholar
Gupta, A., Verma, Y., and Jawahar, C. V. 2012. Choosing linguistics over vision to describe images. In Proceedings of the 26th AAAI Conference on Artificial Intelligence, pp. 606–612.Google Scholar
He, K., Zhang, X., Ren, S., and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778.Google Scholar
Hendricks, L. A., et al. 2016. Deep compositional captioning: describing novel object categories without paired training data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Hendricks, L. A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., and Russell, B. 2017. Localizing moments in video with natural language. In Proceedings of ICCV.Google Scholar
Hochreiter, S., and Schmidhuber, J., 1997. Long short-term memory. Neural Computation 9 (8): 17351780.Google Scholar
Hodosh, M., Young, P., and Hockenmaier, J., 2013. Framing image description as a ranking task: data, models and evaluation metrics. Journal of Artificial Intelligence Research 47: 853899.Google Scholar
Hu, R., Andreas, J., Rohrbach, M., Darrell, T., and Saenko, K. 2017. Learning to reason: end-to-end module networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 804–813.Google Scholar
Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., and Darrell, T. 2016. Natural language object retrieval. In Proceedings of CVPR, IEEE.Google Scholar
Huang, T.-H. K., et al. 2016. Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1233–1239.Google Scholar
Jang, Y., Song, Y., Yu, Y., Kim, Y., and Kim, G. 2017. TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In Proceedings of CVPR.Google Scholar
Jia, X., Gavves, E., Fernando, B., and Tuytelaars, T. 2015. Guiding the long-short term memory model for image caption generation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), IEEE, pp. 24072415.Google Scholar
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C. L., and Girshick, R. 2017a. CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp. 19881997.Google Scholar
Johnson, J., Hariharan, B., van der Maaten, L., Hoffman, J., Fei-Fei, L., Zitnick, C. L., and Girshick, R. 2017b. Inferring and executing programs for visual reasoning. In Proceedings of ICCV.Google Scholar
Karpathy, A., and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp. 31283137.Google Scholar
Karpathy, A., Joulin, A., and Fei-Fei, L. F. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Proceedings of Advances in Neural Information Processing Systems, pp. 1889–1897.Google Scholar
Kazemzadeh, S., Ordonez, V., Matten, M., and Berg, T. 2014. Referitgame: referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798.Google Scholar
Kilickaya, M., Erdem, A., Ikizler-Cinbis, N., and Erdem, E. 2017. Re-evaluating automatic metrics for image captioning. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 1, pp. 199–209.Google Scholar
Kim, K.-M., Heo, M.-O., Choi, S.-H., and Zhang, B.-T. 2017. Deepstory: video story QA by deep embedded memory networks. In Proceedings of IJCAI.Google Scholar
Kinghorn, P., Zhang, L., and Shao, L., 2018. A region-based image caption generator with refined descriptions. Neurocomputing 272: 416424.Google Scholar
Kong, C., Lin, D., Bansal, M., Urtasun, R., and Fidler, S. 2014. What are you talking about? Text-to-image coreference. In Proceedings of CVPR.Google Scholar
Krahmer, E., and van Deemter, K., 2012. Computational generation of referring expressions: a survey. Computational Linguistics 38: 173218.Google Scholar
Krause, J., Johnson, J., Krishna, R., and Fei-Fei, L. 2017. A hierarchical approach for generating descriptive image paragraphs. In CVPR 2017. arXiv preprint arXiv:1611.06607.Google Scholar
Krishna, R., et al. 2017a. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1): 3273.Google Scholar
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., Bernstein, M. S., and Fei-Fei, L. 2017b. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 1–42.Google Scholar
Krizhevsky, A., Sutskever, I., and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of Advances in Neural Information Processing Systems, pp. 1097–1105.Google Scholar
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., and Berg, T. L. 2011. Baby talk: understanding and generating simple image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, pp. 16011608.Google Scholar
Kuznetsova, P., Ordonez, V., Berg, T. L., and Choi, Y., 2014. Treetalk: composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics 2 (10): 351362.Google Scholar
Li, S., Kulkarni, G., Berg, T. L., Berg, A. C., and Choi, Y., 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning, Portland, Oregon, pp. 220228.Google Scholar
Li, X., Lan, W., Dong, J., and Liu, H. 2016. Adding Chinese captions to images. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp. 271–275.Google Scholar
Li, Z., et al. 2017. Tracking by natural language specification. In Proceedings of CVPR.Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. 2014a. Microsoft coco: common objects in context. In Proceedings of ECCV-2014, pp. 740–755.Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014b). Microsoft coco: common objects in context. In Proceedings of European Conference on Computer Vision, Springer, pp. 740–755.Google Scholar
Liu, J., et al. 2017. Referring expression generation and comprehension via attributes. In Proceedings of CVPR.Google Scholar
Lu, J., Xiong, C., Parikh, D., and Socher, R. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 6.Google Scholar
Lu, J., Yang, J., Batra, D., and Parikh, D. 2016. Hierarchical question-image co-attention for visual question answering. In Proceedings of NIPS-2016, pp. 289–297.Google Scholar
Ma, L., Lu, Z., and Li, H. 2016. Learning to answer questions from image using convolutional neural network. In Proceedings of AAAI, vol. 3, pp. 16.Google Scholar
Maharaj, T., Ballas, N., Rohrbach, A., Courville, A., and Pal, C. 2017. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In Proceedings of CVPR-2017.Google Scholar
Malinowski, M., and Fritz, M. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Proceedings of Advances in Neural Information Processing Systems, pp. 1682–1690.Google Scholar
Malinowski, M., Rohrbach, M., and Fritz, M. 2015. Ask your neurons: a neural-based approach to answering questions about images. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), IEEE Computer Society, pp. 19.Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A. L., and Murphy, K. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–20.Google Scholar
Mason, R., and Charniak, E., 2014. Nonparametric method for data-driven image captioning. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 2, Baltimore, Maryland, pp. 592598.Google Scholar
Mathews, A. P., Xie, L., and He, X. 2016. Senticap: generating image descriptions with sentiments. In Proceedings of AAAI, pp. 3574–3580.Google Scholar
Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A., Yamaguchi, K., Berg, T., Stratos, K., and Daumé,III, H., 2012. Midge: generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, pp. 747756.Google Scholar
Mitchell, M., Reiter, E., and van Deemter, K. 2013a. Typicality and object reference. In Proceedings of Cognitive Science.Google Scholar
Mitchell, M., van Deemter, K., and Reiter, E. 2010. Natural reference to objects in a visual domain. In Proceedings of International Natural Language Generation Conference (INLG).Google Scholar
Mitchell, M., van Deemter, K., and Reiter, E. 2011. Two approaches for generating size modifiers. In European Workshop on Natural Language Generation.Google Scholar
Mitchell, M., van Deemter, K., and Reiter, E. 2013b. Generating expressions that refer to visible objects. In Proceedings of NAACL’13.Google Scholar
Miyazaki, T., and Shimizu, N. 2016. Cross-lingual image caption generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1780–1790.Google Scholar
Mun, J., Seo, P. H., Jung, I., and Han, B. 2017. Marioqa: answering questions by watching gameplay videos. In Proceedings of ICCV.Google Scholar
Muscat, A., and Belz, A., 2017. Learning to generate descriptions of visual data anchored in spatial relations. IEEE Computational Intelligence Magazine 12 (3): 2942.Google Scholar
Nagaraja, V. K., Morariu, V. I., and Davis, L. S. 2016. Modeling context between objects for referring expression understanding. In Proceedings of ECCV, Springer.Google Scholar
Nam, H., Ha, J.-W., and Kim, J. 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 299–307.Google Scholar
Ordonez, V., Kulkarni, G., and Berg, T. L. 2011. Im2text: Describing images using 1 million captioned photographs. In Shawe-Taylor et al. (eds.), Advances in Neural Information Processing Systems, vol. 24. Curran Associates, Inc., pp. 1143–1151.Google Scholar
Ortiz, L. G. M., Wolff, C., and Lapata, M. 2015. Learning to interpret and describe abstract scenes. In Proceedings of NAACL-2015, pp. 1505–1515.Google Scholar
Over, P., Fiscus, J., Sanders, G., Joy, D., Michel, M., Awad, G., Smeaton, A., Kraaij, W., and Quénot, G. 2014. TRECVID 2014–an overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID, pp. 52.Google Scholar
Rashtchian, C., Young, P., Hodosh, M., and Hockenmaier, J. 2010. Collecting image annotations using Amazon’s Mechanical Turk. In Proceedings of the NAACL-10 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 139–147.Google Scholar
Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., and Pinkal, M., 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics 1: 2536.Google Scholar
Ren, M., Kiros, R., and Zemel, R. 2015a. Exploring models and data for image question answering. In Proceedings of Advances in Neural Information Processing Systems, pp. 2953–2961.Google Scholar
Ren, S., He, K., Girshick, R., and Sun, J. 2015b. Faster R-CNN: towards real-time object detection with region proposal networks. In Proceedings of Advances in Neural Information Processing Systems, pp. 91–99.Google Scholar
Ren, Y., Van Deemter, K., and Pan, J. Z. 2010. Charting the potential of description logic for the generation of referring expressions. In Proceedings of International Natural Language Generation Conference (INLG).Google Scholar
Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., and Schiele, B. 2016. Grounding of textual phrases in images by reconstruction. In Proceedings of ECCV, Springer.Google Scholar
Rohrbach, A., Rohrbach, M., Tandon, N., and Schiele, B. 2015. A dataset for movie description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3202–3212.Google Scholar
Rosenfeld, A., 1978. Iterative methods in image analysis. Pattern Recognition 10 (3): 181187.Google Scholar
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. 2017. Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of ICCV.Google Scholar
Shih, K. J., Singh, S., and Hoiem, D. 2016. Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4613–4621.Google Scholar
Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. 2012. Indoor segmentation and support inference from rgbd images. In Proceedings of Computer Vision (ECCV-2012), pp. 746–760.Google Scholar
Simonyan, K., and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR 2015, arXiv preprint arXiv:1409.1556.Google Scholar
Socher, R., Karpathy, A., Le, Q. V., Manning, C. D., and Ng, A. Y., 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2: 27218.Google Scholar
Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of AAAI, pp. 4278–4284.Google Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9.Google Scholar
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., and Fidler, S. 2016. Movieqa: understanding stories in movies through question-answering. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Turing, A. M., 1950. Computing machinery and intelligence. Mind 59 (236): 433460.Google Scholar
Unal, M. E., Citamak, B., Yagcioglu, S., Erdem, A., Erdem, E., Cinbis, N. I., and Cakici, R. 2016. Tasviret: a benchmark dataset for automatic Turkish description generation from images. In Proceedings of 24th Signal Processing and Communication Application Conference (SIU), IEEE, pp. 19771980.Google Scholar
Van Deemter, K., Gatt, A., van Gompel, R. P., and Krahmer, E., 2012. Toward a computational psycholinguistics of reference production. Topics in Cognitive Science 4 (2): 166183.Google Scholar
van Deemter, K., van der Sluis, I., and Gatt, A. 2006. Building a semantically transparent corpus for the generation of referring expressions. In Proceedings of International Conference on Natural Language Generation (INLG).Google Scholar
van Miltenburg, E., Elliott, D., and Vossen, P. 2017. Cross-linguistic differences and similarities in image descriptions. In Proceedings of CoRR, abs/1707.01736.Google Scholar
Vedantam, R., Zitnick, C. L., and Parikh, D. 2014. Cider: Consensus-based image description evaluation. InProceedings of CoRR, abs/1411.5726.Google Scholar
Venugopalan, S., Hendricks, L. A., Mooney, R., and Saenko, K. 2016. Improving LSTM-based video description with linguistic knowledge mined from text. In Proceedings of EMNLP-2016, pp. 1961–1966.Google Scholar
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., and Saenko, K. 2015. Translating videos to natural language using deep recurrent neural networks. In Proceedings of NAACL-2015.Google Scholar
Viethen, J., and Dale, R. 2008. The use of spatial relations in referring expression generation. In Proceedings of International Natural Language Generation Conference (INLG).Google Scholar
Viethen, J., and Dale, R. 2010. Speaker-dependent variation in content selection for referring expression generation. In Australasian Language Technology Workshop.Google Scholar
Viethen, J., Mitchell, M., and Krahmer, E. 2013. Graphs and spatial relations in the generation of referring expressions. In Proceedings of ENLG-2013.Google Scholar
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. 2015. Show and tell: a neural image caption generator. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp. 3156–3164.Google Scholar
Wang, L., Li, Y., Huang, J., and Lazebnik, S., 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Transaction Pattern Analysis and Machine Intelligence PP (99): 1.Google Scholar
Wang, L., Li, Y., and Lazebnik, S. 2016a. Learning deep structure-preserving image-text embeddings. In Proceedings of CVPR.Google Scholar
Wang, L., Schwing, A., and Lazebnik, S. 2017. Diverse and accurate image description using a variational auto-encoder with an additive Gaussian encoding space. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Proceedings of Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc, pp. 57565766.Google Scholar
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., and Van Gool, L. 2016b. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of ECCV, Springer.Google Scholar
Winograd, T., 1972. Understanding natural language. Cognitive Psychology 3 (1): 1191.Google Scholar
Wu, Q., Shen, C., Wang, P., Dick, A., and van den Hengel, A., 2017. Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence PP (99): 1.Google Scholar
Wu, Z., and Palmer, M. 1994. Verbs semantics and lexical selection. In Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, pp. 133–138.Google Scholar
Xiong, C., Merity, S., and Socher, R. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of the International Conference on Machine Learning, pp. 2397–2406.Google Scholar
Xu, H., and Saenko, K. 2016. Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In Proceedings of European Conference on Computer Vision, Springer, pp. 451466.Google Scholar
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. 2015. Show, attend and tell: neural image caption generation with visual attention. In Proceedings of International Conference on Machine Learning, pp. 2048–2057.Google Scholar
Yagcioglu, S., Erdem, E., Erdem, A., and Cakici, R. 2015. A distributed representation based query expansion approach for image captioning. In Proceedings of the ACL-IJCNLP-2015, vol. 2, pp. 106–111.Google Scholar
Yang, Y., Teo, C. L., Daumé III, H., and Aloimonos, Y., 2011. Corpus-guided sentence generation of natural images. In Proceedings of the 16th Conference on Empirical Methods in Natural Language Processing (EMNLP), Edinburg, Scotland, pp. 444454.Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2129.Google Scholar
Yatskar, M., Vanderwende, L., and Zettlemoyer, L., 2014. See no evil, say no evil: description generation from densely labeled images. In Proceedings of the 3rd Joint Conference on Lexical and Computational Semantics, Dublin, Ireland, pp. 110120.Google Scholar
Yoshikawa, Y., Shigeto, Y., and Takeuchi, A. 2017. Stair captions: constructing a large-scale Japanese image caption dataset. arXiv preprint arXiv:1705.00823.Google Scholar
You, Q., Jin, H., Wang, Z., Fang, C., and Luo, J. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659.Google Scholar
Young, P., Lai, A., Hodosh, M., and Hockenmaier, J., 2014. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2: 6778.Google Scholar
Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., and Berg, T. L. 2018. Mattnet: modular attention network for referring expression comprehension. arXiv preprint arXiv:1801.08186.Google Scholar
Yu, L., Park, E., Berg, A. C., and Berg, T. L. 2015. Visual madlibs: fill in the blank description generation and question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
Yu, L., Poirson, P., Yang, S., Berg, A. C., and Berg, T. L. 2016a. Modeling context in referring expressions. In Proceedings of ECCV-2016, pp. 69–85.Google Scholar
Yu, L., Tan, H., Bansal, M., and Berg, T. L. 2017. A joint speaker–listener–reinforcer model for referring expressions. In Proceedings of CVPR.Google Scholar
Yu, Y., Ko, H., Choi, J., and Kim, G. 2016b. End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of CVPR.Google Scholar
Zhao, Z., Yang, Q., Cai, D., He, X., and Zhuang, Y. 2017. Video question answering via hierarchical spatio-temporal attention networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), vol. 2.Google Scholar
Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., and Fergus, R. 2015. Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167.Google Scholar
Zhu, L., Xu, Z., Yang, Y., and Hauptmann, A. G. 2015. Uncovering temporal context for video question and answering. arXiv preprint arXiv:1511.04670.Google Scholar
Zhu, Y., Groth, O., Bernstein, M., and Fei-Fei, L. 2016. Visual7w: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4995–5004.Google Scholar
Zitnick, C. L., and Parikh, D. 2013. Bringing semantics into focus using visual abstraction. In Proceedings of CVPR-2013, pp. 3009–3016.Google Scholar