Natural Language Engineering: Volume 24 - Language for Images

From image to language and back again
A. BELZ, T.L. BERG, L. YU
Published online by Cambridge University Press:

23 April 2018, pp. 325-362
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Work in computer vision and natural language processing involving images and text has been experiencing explosive growth over the past decade, with a particular boost coming from the neural network revolution. The present volume brings together five research articles from several different corners of the area: multilingual multimodal image description (Frank et al.), multimodal machine translation (Madhyastha et al., Frank et al.), image caption generation (Madhyastha et al., Tanti et al.), visual scene understanding (Silberer et al.), and multimodal learning of high-level attributes (Sorodoc et al.). In this article, we touch upon all of these topics as we review work involving images and text under the three main headings of image description (Section 2), visually grounded referring expression generation (REG) and comprehension (Section 3), and visual question answering (VQA) (Section 4).

Learning quantification from images: A structured neural architecture
I. SORODOC, S. PEZZELLE, A. HERBELOT, M. DIMICCOLI, R. BERNARDI
Published online by Cambridge University Press:

02 April 2018, pp. 363-392
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Major advances have recently been made in merging language and vision representations. Most tasks considered so far have confined themselves to the processing of objects and lexicalised relations amongst objects (content words). We know, however, that humans (even pre-school children) can abstract over raw multimodal data to perform certain types of higher level reasoning, expressed in natural language by function words. A case in point is given by their ability to learn quantifiers, i.e. expressions like few, some and all. From formal semantics and cognitive linguistics, we know that quantifiers are relations over sets which, as a simplification, we can see as proportions. For instance, in most fish are red, most encodes the proportion of fish which are red fish. In this paper, we study how well current neural network strategies model such relations. We propose a task where, given an image and a query expressed by an object–property pair, the system must return a quantifier expressing which proportions of the queried object have the queried property. Our contributions are twofold. First, we show that the best performance on this task involves coupling state-of-the-art attention mechanisms with a network architecture mirroring the logical structure assigned to quantifiers by classic linguistic formalisation. Second, we introduce a new balanced dataset of image scenarios associated with quantification queries, which we hope will foster further research in this area.

Assessing multilingual multimodal image description: Studies of native speaker preferences and translator choices
STELLA FRANK, DESMOND ELLIOTT, LUCIA SPECIA
Published online by Cambridge University Press:

23 April 2018, pp. 393-413
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Two studies on multilingual multimodal image description provide empirical evidence towards two questions at the core of the task: (i) whether target language speakers prefer descriptions generated directly in their native language, as compared to descriptions translated from a different language; (ii) whether images improve human translation of descriptions. These results provide guidance for future work in multimodal natural language processing by first showing that on the whole, translations are not distinguished from native language descriptions, and second delineating and quantifying the information gained from the image during the human translation task.

The role of image representations in vision to language tasks
PRANAVA MADHYASTHA, JOSIAH WANG, LUCIA SPECIA
Published online by Cambridge University Press:

21 March 2018, pp. 415-439
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Tasks that require modeling of both language and visual information, such as image captioning, have become very popular in recent years. Most state-of-the-art approaches make use of image representations obtained from a deep neural network, which are used to generate language information in a variety of ways with end-to-end neural-network-based models. However, it is not clear how different image representations contribute to language generation tasks. In this paper, we probe the representational contribution of the image features in an end-to-end neural modeling framework and study the properties of different types of image representations. We focus on two popular vision to language problems: The task of image captioning and the task of multimodal machine translation. Our analysis provides interesting insights into the representational properties and suggests that end-to-end approaches implicitly learn a visual-semantic subspace and exploit the subspace to generate captions.

Understanding visual scenes
CARINA SILBERER, JASPER UIJLINGS, MIRELLA LAPATA
Published online by Cambridge University Press:

28 March 2018, pp. 441-465
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
A growing body of recent work focuses on the challenging problem of scene understanding using a variety of cross-modal methods which fuse techniques from image and text processing. In this paper, we develop representations for the semantics of scenes by explicitly encoding the objects detected in them and their spatial relations. We represent image content via two well-known types of tree representations, namely constituents and dependencies. Our representations are created deterministically, can be applied to any image dataset irrespective of the task at hand, and are amenable to standard NLP tools developed for tree-based structures. We show that we can apply syntax-based SMT and tree kernel methods in order to build models for image description generation and image-based retrieval. Experimental results on real-world images demonstrate the effectiveness of the framework.

Where to put the image in an image caption generator
MARC TANTI, ALBERT GATT, KENNETH P. CAMILLERI
Published online by Cambridge University Press:

23 April 2018, pp. 467-489
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
When a recurrent neural network (RNN) language model is used for caption generation, the image information can be fed to the neural network either by directly incorporating it in the RNN – conditioning the language model by ‘injecting’ image features – or in a layer following the RNN – conditioning the language model by ‘merging’ image features. While both options are attested in the literature, there is as yet no systematic comparison between the two. In this paper, we empirically show that it is not especially detrimental to performance whether one architecture is used or another. The merge architecture does have practical advantages, as conditioning by merging allows the RNN’s hidden state vector to shrink in size by up to four times. Our results suggest that the visual and linguistic modalities for caption generation need not be jointly encoded by the RNN as that yields large, memory-intensive models with few tangible advantages in performance; rather, the multimodal integration should be delayed to a subsequent stage.

NLE volume 24 issue 3 Cover and Front matter
Published online by Cambridge University Press:

23 April 2018, pp. f1-f2
- Article
- - You have access
- PDF
- Export citation

NLE volume 24 issue 3 Cover and Back matter
Published online by Cambridge University Press:

23 April 2018, pp. b1-b2
- Article
- - You have access
- PDF
- Export citation

Natural Language Processing

Refine listing

Actions for selected content:

Natural Language Engineering, Language for Images

Articles

From image to language and back again

Learning quantification from images: A structured neural architecture

Assessing multilingual multimodal image description: Studies of native speaker preferences and translator choices

The role of image representations in vision to language tasks

Understanding visual scenes

Where to put the image in an image caption generator

Front Cover (OFC, IFC) and matter

NLE volume 24 issue 3 Cover and Front matter

Back Cover (IBC, OBC) and matter

NLE volume 24 issue 3 Cover and Back matter

Natural Language Processing

Refine listing

Actions for selected content:

Save Search

Natural Language Engineering, Language for Images

Articles

Front Cover (OFC, IFC) and matter

Back Cover (IBC, OBC) and matter