Signal Processing - November 2017 - 115

improve the performance of the system and stimulate new researches in deep visual understanding.

Outlook
Image-to-text generation is an important interdisciplinary area
across computer vision and natural language processing. It also
forms the technical foundation of many important applications.
Thanks to deep-learning technologies, we have seen significant
progress in this area in recent years. In this article, we have
reviewed the key developments that the community has made and
their impact in both research and industry deployment. Looking
forward, image captioning will be a key subarea in the image-
natural language multimodal intelligence field. A number of new
problems in this field have been proposed lately, including visual
question answering [54], [55], [70], visual storytelling [58], visually grounded dialog [56], and image synthesis from text description [57], [72]. The progress in multimodal intelligence is critical
for building more general AI abilities in the future, and we hope
the overview provided in this article can encourage students and
researchers to enter and contribute to this exciting AI area.

Authors
Xiaodong He (xiaohe@microsoft.com) received his bachelor's
degree from Tsinghua University, Beijing, China, in 1996, his
M.S. degree from the Chinese Academy of Sciences, Beijing, in
1999, and his Ph.D. degree from the University of Missouri-
Columbia in 2003. He is a principal researcher in the Deep
Learning Group of Microsoft Research, Redmond, Washington.
He is also an affiliate professor in the Department of Electrical
Engineering and Computer Engineering at the University of
Washington, Seattle. His research interests are mainly in artificial intelligence areas including deep learning, natural language
processing, computer vision, speech, information retrieval, and
knowledge representation. He received several awards including
the Outstanding Paper Award at the 2015 Conference of the
Association for Computational Linguistics (ACL). He has held
editorial positions on several IEEE journals, was the area chair
for the North American Chapter of the 2015 Conference of the
ACL, and served on the organizing committee/program committee of major speech and language processing conferences. He is
a Senior Member of the IEEE.
Li Deng (l.deng@ieee.org) received the Ph.D. degree from
the University of Wisconsin-Madison in 1987. He was an assistant professor (1989-1992), tenured associate professor (1992-
1996), and full professor (1996-1999) at the University of
Waterloo, Ontario, Canada. In 1999, he joined Microsoft
Research, Redmond, Washington, where he currently leads the
research and development of deep learning as a partner research
manager of its Deep Learning Technology Center, and where he
is a chief scientist of artificial intelligence. Since 2000, he has
also been an affiliate full professor and graduate committee
member at the University of Washington, Seattle. He is a Fellow
of the IEEE, the Acoustical Society of America, and the
International Speech Communication Association. He served on
the Board of Governors of the IEEE Signal Processing Society
(SPS) (2008-2010), and as editor-in-chief of IEEE Signal

Processing Magazine (2009-2011), which earned the highest
impact factor in 2010 and 2011 among all IEEE publications and
for which he received the 2012 IEEE SPS Meritorious Service
Award. He recently joined Citadel as its chief artificial intelligence officer.

References

[1] N. Ballas, L. Yao, C. Pal, and A. Courville, "Delving deeper into convolutional
networks for learning video representations," in Proc. Int. Conf. Learning
Representations, 2016.
[2] X. Chen and C. Lawrence Zitnick, "Mind's eye: A recurrent visual representation for image caption generation," in Proc. Conf. Computer Vision and Pattern
Recognition, 2015, pp 2422-2431.
[3] J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng, X. He, G. Zweig, and M.
Mitchell, "Language models for image captioning: The quirks and what works,"
in roc. 53rd Annu. Meeting Association Computational Linguistics and the
7th Int. Joint Conf. Natural Language Processing, 2015, Beijing, China, pp.
100-105.
[4] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S.
Venugopalan, K. Saenko, and T. Darrell, "Long-term recurrent convolutional networks for visual recognition and description," in Proc. Conf. Computer Vision and
Pattern Recognition, 2015, pp. 2625-2634.
[5] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. DollĀ“ar, J. Gao, X.
He, M. Mitchell, J. C. Platt, et al. "From captions to visual concepts and back," in
Proc. Conf. Computer Vision and Pattern Recognition, 2015, pp. 1473-1482.
[6] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J.
Hockenmaier, and D. Forsyth, Every picture tells a story: Generating sentences
from images. in Proc. European Conf. Computer Vision, 2010.
[7] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng,
"Semantic compositional networks for visual captioning," Proc. Conf. Computer
Vision and Pattern Recognition, 2017.
[8] R. Girshick. "Fast r-CNN," in Proc. Int. Conf. Computer Vision, 2015, pp.
1440-1448.
[9] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proc. Conf. Computer Vision and Pattern Recognition, 2016, pp 770-778.
[10] M. Hodosh, P. Young, and J. Hockenmaier, "Framing image description as a
ranking task: Data, models and evaluation metrics," J. Artif. Intell. Res., vol. 47, pp.
853-899, 2013.
[11] X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars, "Guiding the long-short
term memory model for image caption generation," in Proc. Int. Conf. Computer
Vision, 2015, pp. 2407-2415.
[12] Karpathy and L. Fei-Fei, "Deep visual-semantic alignments for generating
image descriptions," in Proc. Conf. Computer Vision and Pattern Recognition,
2015, pp. 3128-3137.
[13] R. Kiros, R. Salakhutdinov, and R. S. Zemel, "Multimodal neural language
models," in Proc. Int. Conf. Machine Learning, 2014.
[14] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y.
Kalantidis, L.-J. Li, D. A. Shamma, et al. "Visual genome: Connecting language
and vision using crowdsourced dense image annotations," arXiv Preprint,
arXiv:1602.07332, 2016.
[15] Krizhevsky, I. Sutskever and G. E. Hinton, "Imagenet classification with deep
convolutional neural networks," in Proc. Conf. Neural Information Processing
Systems, 2012.
[16] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and
T. L. Berg. "Babytalk: Understanding and generating simple image descriptions," in
Proc. Conf. Computer Vision and Pattern Recognition, 2011, pp. 1601-1608.
[17] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi, "Composing simple
image descriptions using web-scale n-grams," in Proc. 15th Conf. Computational
Natural Language Learning, 2011, pp. 220-228.
[18] C. Liu, J. Mao, F. Sha, and A. Yuille, "Attention correctness in neural image
captioning," arXiv Preprint, arXiv:1605.09553, 2016.
[19] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, "Deep captioning
with multimodal recurrent neural networks (m-RNN)," in Proc. Int. Conf. Learning
Representations, 2015.
[20] V. Ordonez, G. Kulkarni, T. L. Berg, V. Ordonez, G. Kulkarni, and T. L. Berg,
"Im2text: Describing images using 1 million captioned photographs," in Proc. Conf.
Neural Information Processing Systems, 2011.
[21] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, "Jointly modeling embedding and
translation to bridge video and language," in Proc. Conf. Computer Vision and
Pattern Recognition, 2016, pp. 4594-4602.
[22] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin,
"Variational autoencoder for deep learning of images, labels and captions," in Proc.
Conf. Neural Information Processing Systems, 2016.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2017

|

115



Table of Contents for the Digital Edition of Signal Processing - November 2017

Signal Processing - November 2017 - Cover1
Signal Processing - November 2017 - Cover2
Signal Processing - November 2017 - 1
Signal Processing - November 2017 - 2
Signal Processing - November 2017 - 3
Signal Processing - November 2017 - 4
Signal Processing - November 2017 - 5
Signal Processing - November 2017 - 6
Signal Processing - November 2017 - 7
Signal Processing - November 2017 - 8
Signal Processing - November 2017 - 9
Signal Processing - November 2017 - 10
Signal Processing - November 2017 - 11
Signal Processing - November 2017 - 12
Signal Processing - November 2017 - 13
Signal Processing - November 2017 - 14
Signal Processing - November 2017 - 15
Signal Processing - November 2017 - 16
Signal Processing - November 2017 - 17
Signal Processing - November 2017 - 18
Signal Processing - November 2017 - 19
Signal Processing - November 2017 - 20
Signal Processing - November 2017 - 21
Signal Processing - November 2017 - 22
Signal Processing - November 2017 - 23
Signal Processing - November 2017 - 24
Signal Processing - November 2017 - 25
Signal Processing - November 2017 - 26
Signal Processing - November 2017 - 27
Signal Processing - November 2017 - 28
Signal Processing - November 2017 - 29
Signal Processing - November 2017 - 30
Signal Processing - November 2017 - 31
Signal Processing - November 2017 - 32
Signal Processing - November 2017 - 33
Signal Processing - November 2017 - 34
Signal Processing - November 2017 - 35
Signal Processing - November 2017 - 36
Signal Processing - November 2017 - 37
Signal Processing - November 2017 - 38
Signal Processing - November 2017 - 39
Signal Processing - November 2017 - 40
Signal Processing - November 2017 - 41
Signal Processing - November 2017 - 42
Signal Processing - November 2017 - 43
Signal Processing - November 2017 - 44
Signal Processing - November 2017 - 45
Signal Processing - November 2017 - 46
Signal Processing - November 2017 - 47
Signal Processing - November 2017 - 48
Signal Processing - November 2017 - 49
Signal Processing - November 2017 - 50
Signal Processing - November 2017 - 51
Signal Processing - November 2017 - 52
Signal Processing - November 2017 - 53
Signal Processing - November 2017 - 54
Signal Processing - November 2017 - 55
Signal Processing - November 2017 - 56
Signal Processing - November 2017 - 57
Signal Processing - November 2017 - 58
Signal Processing - November 2017 - 59
Signal Processing - November 2017 - 60
Signal Processing - November 2017 - 61
Signal Processing - November 2017 - 62
Signal Processing - November 2017 - 63
Signal Processing - November 2017 - 64
Signal Processing - November 2017 - 65
Signal Processing - November 2017 - 66
Signal Processing - November 2017 - 67
Signal Processing - November 2017 - 68
Signal Processing - November 2017 - 69
Signal Processing - November 2017 - 70
Signal Processing - November 2017 - 71
Signal Processing - November 2017 - 72
Signal Processing - November 2017 - 73
Signal Processing - November 2017 - 74
Signal Processing - November 2017 - 75
Signal Processing - November 2017 - 76
Signal Processing - November 2017 - 77
Signal Processing - November 2017 - 78
Signal Processing - November 2017 - 79
Signal Processing - November 2017 - 80
Signal Processing - November 2017 - 81
Signal Processing - November 2017 - 82
Signal Processing - November 2017 - 83
Signal Processing - November 2017 - 84
Signal Processing - November 2017 - 85
Signal Processing - November 2017 - 86
Signal Processing - November 2017 - 87
Signal Processing - November 2017 - 88
Signal Processing - November 2017 - 89
Signal Processing - November 2017 - 90
Signal Processing - November 2017 - 91
Signal Processing - November 2017 - 92
Signal Processing - November 2017 - 93
Signal Processing - November 2017 - 94
Signal Processing - November 2017 - 95
Signal Processing - November 2017 - 96
Signal Processing - November 2017 - 97
Signal Processing - November 2017 - 98
Signal Processing - November 2017 - 99
Signal Processing - November 2017 - 100
Signal Processing - November 2017 - 101
Signal Processing - November 2017 - 102
Signal Processing - November 2017 - 103
Signal Processing - November 2017 - 104
Signal Processing - November 2017 - 105
Signal Processing - November 2017 - 106
Signal Processing - November 2017 - 107
Signal Processing - November 2017 - 108
Signal Processing - November 2017 - 109
Signal Processing - November 2017 - 110
Signal Processing - November 2017 - 111
Signal Processing - November 2017 - 112
Signal Processing - November 2017 - 113
Signal Processing - November 2017 - 114
Signal Processing - November 2017 - 115
Signal Processing - November 2017 - 116
Signal Processing - November 2017 - 117
Signal Processing - November 2017 - 118
Signal Processing - November 2017 - 119
Signal Processing - November 2017 - 120
Signal Processing - November 2017 - 121
Signal Processing - November 2017 - 122
Signal Processing - November 2017 - 123
Signal Processing - November 2017 - 124
Signal Processing - November 2017 - 125
Signal Processing - November 2017 - 126
Signal Processing - November 2017 - 127
Signal Processing - November 2017 - 128
Signal Processing - November 2017 - 129
Signal Processing - November 2017 - 130
Signal Processing - November 2017 - 131
Signal Processing - November 2017 - 132
Signal Processing - November 2017 - 133
Signal Processing - November 2017 - 134
Signal Processing - November 2017 - 135
Signal Processing - November 2017 - 136
Signal Processing - November 2017 - 137
Signal Processing - November 2017 - 138
Signal Processing - November 2017 - 139
Signal Processing - November 2017 - 140
Signal Processing - November 2017 - 141
Signal Processing - November 2017 - 142
Signal Processing - November 2017 - 143
Signal Processing - November 2017 - 144
Signal Processing - November 2017 - 145
Signal Processing - November 2017 - 146
Signal Processing - November 2017 - 147
Signal Processing - November 2017 - 148
Signal Processing - November 2017 - 149
Signal Processing - November 2017 - 150
Signal Processing - November 2017 - 151
Signal Processing - November 2017 - 152
Signal Processing - November 2017 - 153
Signal Processing - November 2017 - 154
Signal Processing - November 2017 - 155
Signal Processing - November 2017 - 156
Signal Processing - November 2017 - 157
Signal Processing - November 2017 - 158
Signal Processing - November 2017 - 159
Signal Processing - November 2017 - 160
Signal Processing - November 2017 - 161
Signal Processing - November 2017 - 162
Signal Processing - November 2017 - 163
Signal Processing - November 2017 - 164
Signal Processing - November 2017 - 165
Signal Processing - November 2017 - 166
Signal Processing - November 2017 - 167
Signal Processing - November 2017 - 168
Signal Processing - November 2017 - 169
Signal Processing - November 2017 - 170
Signal Processing - November 2017 - 171
Signal Processing - November 2017 - 172
Signal Processing - November 2017 - 173
Signal Processing - November 2017 - 174
Signal Processing - November 2017 - 175
Signal Processing - November 2017 - 176
Signal Processing - November 2017 - Cover3
Signal Processing - November 2017 - Cover4
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_201809
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_201807
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_201805
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_201803
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_201801
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1117
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0917
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0717
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0517
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0317
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0117
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1116
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0916
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0716
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0516
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0316
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0116
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1115
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0915
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0715
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0515
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0315
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0115
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1114
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0914
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0714
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0514
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0314
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0114
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1113
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0913
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0713
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0513
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0313
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0113
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1112
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0912
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0712
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0512
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0312
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0112
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1111
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0911
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0711
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0511
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0311
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0111
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1110
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0910
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0710
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0510
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0310
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0110
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1109
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0909
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0709
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0509
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0309
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0109
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1108
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0908
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0708
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0508
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0308
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0108
https://www.nxtbookmedia.com