Hostname: page-component-586b7cd67f-2plfb Total loading time: 0 Render date: 2024-11-25T00:47:10.896Z Has data issue: false hasContentIssue false

Where to put the image in an image caption generator

Published online by Cambridge University Press:  23 April 2018

MARC TANTI
Affiliation:
Institute of Linguistics and Language Technology, University of Malta, Msida MSD, Malta e-mail: [email protected], [email protected]
ALBERT GATT
Affiliation:
Institute of Linguistics and Language Technology, University of Malta, Msida MSD, Malta e-mail: [email protected], [email protected]
KENNETH P. CAMILLERI
Affiliation:
Department of Systems and Control Engineering, University of Malta, Msida MSD, Malta e-mail: [email protected]

Abstract

When a recurrent neural network (RNN) language model is used for caption generation, the image information can be fed to the neural network either by directly incorporating it in the RNN – conditioning the language model by ‘injecting’ image features – or in a layer following the RNN – conditioning the language model by ‘merging’ image features. While both options are attested in the literature, there is as yet no systematic comparison between the two. In this paper, we empirically show that it is not especially detrimental to performance whether one architecture is used or another. The merge architecture does have practical advantages, as conditioning by merging allows the RNN’s hidden state vector to shrink in size by up to four times. Our results suggest that the visual and linguistic modalities for caption generation need not be jointly encoded by the RNN as that yields large, memory-intensive models with few tangible advantages in performance; rather, the multimodal integration should be delayed to a subsequent stage.

Type
Articles
Copyright
Copyright © Cambridge University Press 2018 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Banerjee, S., and Lavie, A. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, vol.u 29, pp. 65–72.Google Scholar
Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., and Plank, B., 2016. Automatic description generation from images: a survey of models, datasets, and evaluation measures. JAIR 55: 409–42.CrossRefGoogle Scholar
Chen, X. and Zitnick, C. L. (2014). Learning a recurrent visual representation for image caption generation. CoRR, 1411.5654.Google Scholar
Chen, X., and Zitnick, C. L. 2015. Mind’s eye: a recurrent visual representation for image caption generation. In Proceedings of the CVPR’15.CrossRefGoogle Scholar
Chung, J., Gülçehre, Ç., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, 1412.3555.Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. 2009. ImageNet: a large-scale hierarchical image database. In Proceedings of the CVPR’09.Google Scholar
Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., Zweig, G., and Mitchell, M. (2015). Language models for image captioning: the quirks and what works. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Volume 2: Short Papers, Beijing, China, pp. 100105.Google Scholar
Diederik, P. K., and Ba, J. 2014. Adam: a method for stochastic optimization. CoRR, 1412.6980.Google Scholar
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., and Darrell, T. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the CVPR’15.Google Scholar
Glorot, X., and Bengio, Y., 2010. Understanding the difficulty of training deep feedforward neural networks. Aistats 9: 249–56.Google Scholar
Harnad, S., 1990. The symbol grounding problem. Physica D 42: 335–46.CrossRefGoogle Scholar
Hendricks, L. A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., and Darrell, T. 2016. Deep compositional captioning: describing novel object categories without paired training data. In Proceedings of the CVPR’16.Google Scholar
Hessel, J., Savva, N., and Wilber, M. J. 2015. Image representations and new domains in neural image captioning. CoRR, 1508.02091.Google Scholar
Hochreiter, S., and Schmidhuber, J., 1997. Long short-term memory. Neural Computation 9 (8): 1735–80.Google Scholar
Hodosh, M., Young, P., and Hockenmaier, J., 2013. Framing image description as a ranking task: data, models and evaluation metrics. JAIR 47 (1): 853–99.Google Scholar
Karpathy, A., and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the CVPR’15.Google Scholar
Kiros, R., Salakhutdinov, R., and Zemel, R. S. 2014a. Multimodal neural language models. In Proceedings of the ICML’14, pp. 595–603.Google Scholar
Kiros, R., Salakhutdinov, R., and Zemel, R. S. 2014b. Unifying visual-semantic embeddings with multimodal neural language models. CoRR, 1411.2539.Google Scholar
Lin, C.-Y. and Och, F. J. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the ACL’04.Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. 2014. Microsoft COCO: common objects in context. In Proceedings of the ECCV’14, pp. 740–55.Google Scholar
Liu, S., Zhu, Z., Ye, N., Guadarrama, S., and Murphy, K. 2016. Optimization of image description metrics using policy gradient methods. CoRR, 1612.00370.Google Scholar
Lu, J., Xiong, C., Parikh, D., and Socher, R. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 3242–3250.Google Scholar
Ma, S., and Han, Y. 2016. Describing images by feeding LSTM with structural words. In Proceedings of the ICME’16.Google Scholar
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. 2015a. Deep captioning with multimodal recurrent neural networks (m-RNN). In Proceedings of the ICLR’15.Google Scholar
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. 2015b. Learning like a child: fast novel visual concept learning from sentence descriptions of images. In Proceedings of the ICCV’15.Google Scholar
Mao, J., Xu, W., Yang, Y., Wang, J., and Yuille, A. L. 2014. Explain images with multimodal recurrent neural networks. In Proceedings of the NIPS’14.Google Scholar
Mikolov, T., Chen, K., Corrado, G., and Dean, J. 2013. Efficient estimation of word representations in vector space. CoRR, 1301.3781.Google Scholar
Mnih, A., and Hinton, G. 2007. Three new graphical models for statistical language modelling. In Proceedings of the ICML’07.Google Scholar
Nina, O., and Rodriguez, A. 2015. Simplified LSTM unit and search space probability exploration for image description. In Proceedings of the ICICS’15.Google Scholar
Oruganti, R. M., Sah, S., Pillai, S., and Ptucha, R. 2016. Image description through fusion based recurrent multi-modal learning. In Proceedings of the ICIP’16.CrossRefGoogle Scholar
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the ACL’02, pp. 311–8.Google Scholar
Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., and Goel, V., 2017. Self-critical sequence training for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, July 21-26, 2017, Honolulu, HI, USA, pp. 11791195.Google Scholar
Roy, D., 2005. Semiotic schemas: A framework for grounding language in action and perception. Artificial Intelligence 167 (1–2): 170205.Google Scholar
Simonyan, K. and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. CoRR, 1409.1556.Google Scholar
Song, M., and Yoo, C. D. 2016. Multimodal representation: Kneser-Ney smoothing/skip-gram based neural language model. In Proceedings of the ICIP’16.Google Scholar
Sutskever, I., Vinyals, O., and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 31043112. Curran Associates, Inc.Google Scholar
Vedantam, R., Zitnick, C. L., and Parikh, D. 2015. CIDEr: consensus-based image description evaluation. In Proceedings of the CVPR’15.Google Scholar
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. 2015. Show and tell: A neural image caption generator. In Proceedings of the CVPR’15.Google Scholar
Wang, M., Song, L., Yang, X., and Luo, C. 2016. A parallel-fusion RNN-LSTM architecture for image caption generation. In Proceedings of the ICIP’16.Google Scholar
Wu, Q., Shen, C., van den Hengel, A., Liu, L., and Dick, A. R. 2015. Image captioning with an intermediate attributes layer. CoRR, 1506.01144.Google Scholar
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., and Bengio, Y. 2015. Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the ICML’15.Google Scholar
Yao, T., Pan, Y., Li, Y., Qiu, Z., and Mei, T., 2017. Boosting image captioning with attributes. In IEEE International Conference on Computer Vision, ICCV 2017, October 22-29, 2017, Venice, Italy, pp. 49044912.Google Scholar
You, Q., Jin, H., Wang, Z., Fang, C., and Luo, J. 2016. Image captioning with semantic attention. In Proceedings of the CVPR’16.CrossRefGoogle Scholar
Young, P., Lai, A., Hodosh, M., and Hockenmaier, J., 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2: 6778.Google Scholar
Zhou, L., Xu, C., Koch, P., and Corso, J. J. 2016. Image caption generation with text-conditional semantic attention. CoRR, 1606.04621.Google Scholar