I. INTRODUCTION
Computer vision (CV) and natural language processing (NLP) are two most fundamental disciplines under a broad area of artificial intelligence (AI). CV is regarded as a field of research that explores the techniques to teach computers to see and understand the digital content such as images and videos. NLP is a branch of linguistics that enables computers to process, interpret, and even generate human language. With the rise and development of deep learning over the past decade, there has been a steady momentum of innovation and breakthroughs that convincingly push the limits and improve the state-of-the-art of both vision and language modeling. An interesting observation is that the research in the two area starts to interact and many previous experiences have shown that by doing so can naturally build up the circle of human intelligence.
In general, the interactions between vision and language have proceeded along two dimensions: vision to language and language to vision. The former predominantly recognizes or describes the visual content with a set of individual words or a natural sentence in the form of tags [Reference Yao, Mei, Ngo and Li1], answers [Reference Anderson2], captions [Reference Yao, Pan, Li, Qiu and Mei3–Reference Yao, Pan, Li and Mei5], and comments [Reference Li, Yao, Mei, Chao and Rui6]. For example, a tag usually denotes a specific object, action, or event in visual content. An answer is a response to a question about the details depicted in an image or a video. A caption goes beyond tags or answers by producing a natural-language utterance (usually a sentence) and a comment is also a sentence which expresses an emotional state on visual content. The latter of language to vision basically generates visual content according to natural language inputs. One typical application is to create an image or a video from text. For instance, given a textual description of “this small bird has short beak and dark stripe down the top, the wings are a mix of brown, white, and black,” the goal of text-to-image synthesis is to generate a bird image which meets all the details.
This paper reviews the recent state-of-the-art advances of AI technologies which boost both vision to language, particularly image/video captioning, and language to vision. The real-world deployments in the two fields are also presented as the good examples of how AI transforms the customer experiences and enhances user engagement in industrial applications. The remaining sections are organized as follows. Section II describes the development of vision to language by outlining a brief road map of key technologies on image/video captioning, distilling a typical encoder–decoder structure, and summarizing the evaluations on a popular benchmark. The practical applications of vision to language are further presented. Section III details the technical advancements on language to vision in terms of different conditions and strategies for generation, followed by a summary of progresses on language to image, language to video, and AI-empowered applications. Finally, we conclude the paper in Section IV.
II. VISION TO LANGUAGE
This section summarizes the development of vision to language (particularly image/video captioning) in several aspects, ranging from the road map of key techniques and benchmarks, typical encoder–decoder architectures, to the evaluation results of representative methods.
A) Road map of vision to language
In the past 10 years, we have witnessed researchers strived to push the limits of vision to language systems (e.g. image/video captioning). Figure 1 depicts the road map for the techniques behind vision (image/video) to language and the corresponding benchmarks. Specifically, the year of 2015 is actually a watershed in captioning. Before that, the main stream of captioning is a template-based method [Reference Kulkarni14,Reference Yang, Teo, Daumé III and Aloimonos15] in image domain. The basic idea is to detect the objects or actions in an image and integrate these words into pre-defined sentence templates as subjective, verb, and objective. At that time, most of the image captioning datasets are ready to use, such as Flickr30K and MSCOCO. At the year 2015, deep learning-based image captioning models are first presented. The common design [Reference Vinyals, Toshev, Bengio and Erhan13] is to employ a Convolutional Neural Network (CNN) as an image encoder to produce image representations and exploit a decoder of Long Short-Term Memory (LSTM) to generate the sentence. The attention mechanism [Reference Xu16] is also proposed at that year which locates the most relevant spatial regions when predicting each word. After that, the area of image captioning is growing very fast. Researchers came up with a series of innovations, such as augmenting image features with semantic attributes [Reference Yao, Pan, Li, Qiu and Mei3] or visual relations [Reference Yao, Pan, Li and Mei4], predicting novel objects through leveraging unpaired training data [Reference Li, Yao, Pan, Chao and Mei17,Reference Yao, Pan, Li and Mei18], and even going a step further to perform language navigation [Reference Wang19]. Another extension direction of captioning in image domain is to produce multiple sentences or phrases for an image, aiming to recapitulate more details within image. In between, dense image captioning [Reference Johnson, Karpathy and Fei-Fei20] and image paragraph generation [Reference Wang, Pan, Yao, Tang and Mei21] are typical ones, which generate a set of descriptions or paragraph that describes image in a finer fashion.
The start point of captioning in video domain is also in the year of 2015. Then, researchers start to remould the CNN plus RNN captioning framework toward the scenario of captioning in video domain. A series of techniques (e.g. temporal attention, embedding, or attributes) are explored to further improve video captioning. Concretely, Yao et al.'s technique [Reference Yao22] is one of the early attempts that incorporates temporal attention mechanism into captioning framework by learning to attend to the most relevant frames at each decoding time step. Pan et al. [Reference Pan, Mei, Yao, Li and Rui23] integrate LSTM with semantic embedding to preserve the semantic relevance between video content and the entire sentence. Pan et al. [Reference Pan, Yao, Li and Mei24] further augment captioning model to emphasize the detected visual attributes in the generated sentence. It is also worthy mentioned that in 2016, MSR-VTT video captioning dataset [Reference Xu, Mei, Yao and Rui25] is released which has been widely used and already downloaded by more than 100 groups worldwide. Most recently, Aafaq et al. [Reference Aafaq, Akhtar, Liu, Gilani and Mian26] apply short Fourier transform across all the frame-level features along the temporal dimension to fuse all frame-level features into video-level representation and further enhance video captioning. Another recent attempt for video captioning is to speed up the training procedure by fully employing convolutions in both encoder and decoder networks [Reference Chen, Pan, Li, Yao, Chao and Mei27]. Nevertheless, considering that videos in real life are usually long and contain multiple events, the conventional video captioning methods generating only one caption for a video in general will fail to recapitulate all the events in the video. Hence the task of dense video captioning [Reference Krishna, Hata, Ren, Fei-Fei and Niebles28,Reference Li, Yao, Pan, Chao and Mei29] is introduced recently and the ultimate goal is to generate a sentence for each event occurring in the video.
B) Typical architectures
According to the road map of vision to language, the mainstream of modern image captioning follows the structure of CNN encoder plus LSTM decoder, as shown in Fig. 2(a). In particular, given an image, image features can be firstly extracted through multiple ways: (1) directly taking the outputs of fully-connected layers as image features [Reference Vinyals, Toshev, Bengio and Erhan13]; (2) incorporating high-level semantic attributes into image features [Reference Yao, Pan, Li, Qiu and Mei3]; (3) performing attention mechanism to measure the contribution of each image region [Reference Xu16]; (4) extracting region-level features [Reference Anderson2] and further exploring relation [Reference Yao, Pan, Li and Mei4] or image hierarchy [Reference Yao, Pan, Li and Mei5] on the region-level features. The image features will be further fed into LSTM decoder to generate the output sentence, one word at each time step. In the training stage, the next word is generated based on the previous ground-truth words while during testing the model uses the previously generated words to predict the next word. In order to bridge the mismatch between training and testing, reinforcement learning [Reference Rennie, Marcheret, Mroueh, Ross and Goel8,Reference Liu, Zhu, Ye, Guadarrama and Murphy9,Reference Ren, Wang, Zhang, Lv and Li30] is usually exploited to directly optimize LSTM decoder with the sentence-level reward, such as CIDEr or METEOR.
Taking the inspiration from the recent successes of Transformer self-attention networks [Reference Vaswani31] in machine translation, recent attention has been geared toward exploring Transformer-based structure [Reference Sharma, Ding, Goodman and Soricut32] in image captioning. Figure 2(b) depicts the typical architecture of Transformer-based encoder–decoder. Different from CNN encoder plus LSTM decoder that capitalizes on LSTM to model word dependency, Transformer-based encoder–decoder model fully utilizes attention mechanism to capture the global dependencies among inputs. For encoder, N multi-head self-attention layers are stacked to model the self-attention among input image regions. The decoder contains a stack of N multi-head attention layers, each of which consists of a self-attention sub-layer and a cross-attention sub-layer. More specifically, the self-attention sub-layer is firstly adopted to capture word dependency and the cross-attention sub-layer is further utilized to exploit the co-attention across vision (image regions from encoder) and language (input words).
Similar to the mainstream in image captioning, the typical paradigm in video captioning is also essentially an encoder–decoder structure. A video is first encoded into a set of frame/clip/shot features via 2D CNN [Reference Vinyals, Toshev, Bengio and Erhan13] or 3D CNN [Reference Qiu, Yao and Mei33,Reference Tran, Bourdev, Fergus, Torresani and Paluri34]. Next, all the frame-level, clip-level or shot-level visual features are fused into video-level representations through pooling [Reference Pan, Mei, Yao, Li and Rui23], attention [Reference Yao22], or LSTM-based encoder [Reference Venugopalan, Rohrbach, Donahue, Mooney, Darrell and Saenko35]. The video-level features are then fed into LSTM decoder to produce a natural sentence.
C) Evaluation and applications
Evaluation. Here we summarize the reported performance of representative image captioning methods on the testing server of popular benchmark COCO [Reference Lin36] in Table 1. In terms of all the evaluation metrics, GCN-LSTM [Reference Yao, Pan, Li and Mei4] and HIP [Reference Yao, Pan, Li and Mei5] lead to performance boost against other captioning systems, which verifies the advantage of exploring relations and hierarchal structure among image regions.
Applications. Recently, there exist several emerging applications which involve the technology of vision to language. For example, captioning is integrated into online chatbot [Reference Pan, Qiu, Yao, Li and Mei37,Reference Tran38] and an ai-created poetry [Reference Zhou, Gao, Li and Shum39] is published in China. In JD.com, we utilize captioning techniques for personalized product description generation last year, which aims to produce compelling recommendation reasons for billions of products automatically.
III. LANGUAGE TO VISION
This section discusses from another direction of “language to vision,” i.e. visual content generation guided by language inputs. In this section, we start by reviewing the road map development, as well as the technical advancements in this area. Then we discuss the open issues and applications particularly from the perspective of industry.
Visual Content Generation. We briefly introduce the domain of visual generation, since “language to vision” is deeply rooted in the same techniques. Over the past few years, we have witnessed great progresses in visual content generation. The origin of visual generation dates back to [Reference Goodfellow40], where multiple networks are jointly trained in an adversarial manner. Subsequent works generate images in specific domains such as face [Reference Chen, Chen, Zhang, Mitchell and Yu41–Reference Karras, Laine and Aila43], person [Reference Ma, Jia, Sun, Schiele, Tuytelaars and Van Gool44–Reference Song, Zhang, Liu and Mei46], as well as generic domains [Reference Brock, Donahue and Simonyan47,Reference Lučić, Tschannen, Ritter, Zhai, Bachem and Gelly48]. From the perspective of inputs, the generation can also be treated as conditioning on different information, e.g. noise vector [Reference Goodfellow40], semantic label [Reference Mirza and Osindero49], textual captions [Reference Reed, Akata, Yan, Logeswaran, Schiele and Lee50], scene-graph [Reference Johnson, Gupta and Fei-Fei51], and images [Reference Isola, Zhu, Zhou and Efros52,Reference Zhu, Park, Isola and Efros53]. Among all these works, visual generation based on natural languages plays one of the most promising branches, since semantics are directly incorporated into the pixel-wise generation process.
A) Road map of language to vision
Figure 3 summarizes recent development of “language to vision.” In general, both the vision and language modalities are becoming more and more complicated, and the results are much more visually convincing, compared to when it was firstly introduced in 2014.
The fundamental architecture is based on a conditional generative adversarial network, where the conditioning input is usually the encoded natural language. After a series of transposed-convolutions, the language input is gradually mapped to a visual image with higher and higher resolution. The key challenges are in two folds: (1) how to interpret the language input, i.e. language representation, and (2) how to align the visual and textual modalities, i.e. the semantic consistency between vision and language. Recent results on single object (bottom) have already been visually plausible to human perception. However, state-of-the-art models are still struggling in generating scenes with multiple objects interacting with each other.
B) Technical advancements
The success in language to vision generation is mostly based on the following technical advancements, which have become standard practices commonly accepted by the research community.
Conditioning Input. Following the standard GAN framework [Reference Goodfellow40,Reference Mirza and Osindero49] derived the conditional version GAN, which allows visual generation according to language inputs. The conditioning information can be in any form of language, such as tag, sentence, paragraph, image, scene-graph, and layout. Almost all subsequent works in “language to vision” are based on the conditioning architecture. However at that time, only MNIST [Reference LeCun, Bottou, Bengio and Haffner54] digits are demonstrated in low resolution, and the conditioning information is merely a digit-label.
Text Encoding. GAN-INT-CLS [Reference Reed, Akata, Yan, Logeswaran, Schiele and Lee50] is the first work based on natural-language inputs. For the first time, it bridges the gap from natural language sentences to image pixels. The key step is based on learning a text representation based on a recurrent network to capture visual clues. The rest is mostly following [Reference Mirza and Osindero49]. Additionally, a matching-aware discriminator is proposed to keep the consistency between the generated image and textual input. Though the results still look primitive, people can draw flower images by altering textual inputs.
Stacked Architecture. Another big advancement is by stackGAN [Reference Zhang55,Reference Zhang56], where stacked generators are introduced for high-resolution image generation. Different from previous works, stackGAN can generate realistic $256\times 256$-pixel images by decomposing the generator into multiple stages stacked sequentially. The Stage-I network only sketches the primitive shape and color of the object based on text representation, yielding a low resolution image. The Stage-II network further fills details, such as textures, conditioning on the Stage-I result. A conditioning augmentation technique is also introduced to augment the textual input and stabilize the training process. Compared to [Reference Reed, Akata, Yan, Logeswaran, Schiele and Lee50], the visual quality is much improved based on this stacked architecture. Similar idea is also adopted in Progressively-Growing GAN [Reference Karras, Aila, Laine and Lehtinen42].
Attention Mechanism. As in other vision tasks, attention is effective in highlighting key information. In “language to vision,” attention is particularly useful in aligning keywords (language) and image patches (vision) during the generation process. Two generations (v1.0 and v2.0) of attention basically follow this paradigm, but differs in many details, e.g. network architecture, text encoding. Attention 1.0, AlignDraw [Reference Mansimov, Parisotto, Ba and Salakhutdinov57], proposes to iteratively paint on a canvas by looking at different words at different stages. However, the results were not promising at that time. Attention 2.0, AttnGAN [Reference Xu58] and DA-GAN [Reference Ma, Fu, Chen and Mei59], basically follows the similar paradigm, but improves significantly on image quality, e.g. fine-grained details.
Semantic Layout. Recent studies [Reference Song, Zhang, Liu and Mei46,Reference Bau60,Reference Dong, Liang, Gong, Lai, Zhu and Yin61] have demonstrated the importance of semantic layout in image generation, where layout acts as the blue-print to guide the generation process. In language to vision, semantic layout and scene-graph are introduced to reshape the language input with more semantics. Hong et al. [Reference Hong, Yang, Choi and Lee62] propose to generate object bounding-boxes first, and then refine by estimating appearances inside each box. Johnson et al. [Reference Johnson, Gupta and Fei-Fei51] encode objects relationship from scene graph to construct the layout for decoder generation with graph convolutions. Zheng et al. [Reference Zheng, Bai, Zhang and Mei63] introduce spatial constraint module and contextual fusion module to model the relative scale and offset among objects for commonsense layout generation, and Hinz et al. [Reference Hinz, Heinrich and Wermter64] further propose an object pathway for multi-objects generation with complex spatial layouts.
C) Progress and applications
The development of “Language to Vision” can be summarized as follows. On one hand, the language description is becoming more complex, i.e. from simple words to long sentences. On the other hand, the vision part is also becoming more complex, where objects-interaction and fine-grained detail are expected:
• Language: label $\rightarrow$ sentence $\rightarrow$ paragraph $\rightarrow$ scene graph
• Vision: single object $\rightarrow$ multiple objects
Language to Image. Early studies mainly focus on simple words and single-object images, e.g. birds [Reference Welinder65], flowers [Reference Nilsback and Zisserman66], and generic objects [Reference Russakovsky67]. As shown in Fig. 3 (bottom), the visual quality is much improved over the past few years, and some results are plausible enough to deceive human eyes.
Though single-object image can be well generated, multi-objects scene still struggles for realistic results, as in Fig. 3 (top). A general trend is to reduce the complexity by introducing semantic layout as an intermediate representation. Roughly, machines now can generate spatially reasonable images, but fine-grained details are still far from satisfactory at current stage.
Language to Video. Compared to image, language to video is more challenging due to huge volume of information and extra temporal constraint. There is only a few works studying this area. For example, Pan et al. [Reference Pan, Qiu, Yao, Li and Mei68] attempt to generate video out of captions based on 3D convolution operation. However, the results are quite limited for practical applications.
Applications. The application of “language to vision” can be roughly grouped into two categories: generation for human eyes or for machines. In certain domains (e.g. face), language to vision already starts to produce highly plausible results with industrial standardsFootnote 1. For example, people can generate royalty-free facial photos on demandFootnote 2 for games [Reference Shi, Yuan, Fan, Zou, Shi and Liu69] or commercials, by manually specifying gender, hair, eyes. Another direction is generating data for machine and algorithms. For example, NVIDIA [Reference Zheng, Yang, Yu, Zheng, Yang and Kautz70] proposed a large-scale synthetic dataset (DG-Market) for training person Re-ID models. Also some image recognition and segmentation models start to benefit from machine-generated training images. However, it is worth noting that despite the promising results, there is still a large gap for massive deployment in industrial products.
IV. CONCLUSION
Vision and language are two fundamental systems of human representation. Integrating the two in one intelligent system has long been an ambition in AI field. As we have discussed in the paper, on one hand, vision to language is capable of understanding visual content and automatically producing a natural-language description, and on the other hand, language to vision is able to characterize the intrinsic structure in vision data and create visual content according to the language inputs. Such interactions, while still at the early stage, motivate us to understand the mechanisms in connecting vision and language, reshape real-world applications, and re-think the end result of the integration.
Tao Mei is a Technical Vice President with JD.com and the Deputy Managing Director of JD AI Research, where he also serves as the Director of Computer Vision and Multimedia Lab. Prior to joining JD.com in 2018, he was a Senior Research Manager with Microsoft Research Asia in Beijing, China. He has authored or co-authored over 200 publications (with 12 best paper awards) in journals and conferences. He holds over 50 US and international patents. He is or has been an Editorial Board Member of IEEE Trans. on Image Processing, IEEE Trans. on Circuits and Systems for Video Technology, IEEE Trans. on Multimedia, ACM Trans. on Multimedia, Pattern Recognition, etc. He is a Fellow of IEEE (2019), a Fellow of IAPR (2016), a Distinguished Scientist of ACM (2016), a Distinguished Industry Leader of APSIPA (2019), and a Distinguished Industry Speaker of IEEE Signal Processing Society (2017).
Wei Zhang is now a Senior Researcher in JD AI Research, Beijing, China. He received his Ph.D degree from the Department of Computer Science in the City University of Hong Kong, Hong Kong, China, in 2015. He was a visiting scholar in the DVMM group of Columbia University, New York, NY, USA, in 2014. He was in the Chinese Academy of Sciences. His research interests include computer vision, visual object analysis. He has won the runner-up in TRECVID Instance Search in 2012, the Best Demo Award in ACM-HK openday 2013. He serves as the guest editor for TOMM, co-chair for ICME workshop, MMM special session.
Ting Yao is currently a Principal Researcher in Vision and Multimedia Lab at JD AI Research, Beijing, China. His team is focusing on the research and innovation of video understanding, vision and language, and deep learning. Prior to joining JD.com, he was a Researcher with Microsoft Research Asia in Beijing, China. Dr. Yao is an active participant of several benchmark evaluations. He is the principal designer of the top-performing multimedia analytic systems in international competitions such as COCO Image Captioning, Visual Domain Adaptation Challenge 2019 & 2018 & 2017, and ActivityNet Large Scale Activity Recognition Challenge 2019 & 2018 & 2017 & 2016. His works have led to many awards, including ACM SIGMM Outstanding Ph.D. Thesis Award 2015, ACM SIGMM Rising Star Award 2019, and IEEE TCMC Rising Star Award 2019. He is also an Associate Editor of IEEE Trans. on Multimedia.