Signal Processing - November 2017 - 116

[23] K. Simonyan and A. Zisserman, "Very deep convolutional networks for largescale image recognition," in Proc. Comput. Sci. Conf., 2014.
[24] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K.
Saenko, "Translating videos to natural language using deep recurrent neural networks," in Proc. Conf. North American Chapter Association Computational
Linguistics: Human Language Technologies, 2015, pp. 1494-1505.
[25] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K.
Saenko, "Sequence to sequence-video to text," in Proc. Int. Conf. Computer Vision,
2015, pp. 4534-4542.
[26] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. "Show and tell: A neural
image caption generator," in Proc. Conf. Computer Vision and Pattern
Recognition, 2015, pp. 3156-3164.
[27] J. Johnson, A. Karpathy, and L. Fei-Fei, "Densecap: Fully convolutional localization networks for dense captioning," in Proc. IEEE Conf. Computer Vision and
Pattern Recognition (CVPR), 2015, pp. 4565-4574.
[28] Q. Wu, C. Shen, L. Liu, A. Dick, and A. v d. Hengel, "What value do explicit
high level concepts have in vision to language problems?" in Proc. Conf. Computer
Vision and Pattern Recognition, 2016, pp. 203-212.
[29] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and
Y. Bengio, "Show, attend and tell: Neural image caption generation with visual
attention," in Proc. Int. Conf. Machine Learning, 2015.
[30] Z. Yang, Y. Yuan, Y. Wu, R. Salakhutdinov, and W. W. Cohen, "Review networks for caption generation," in Proc. Conf. Neural Information Processing
Systems, 2016.
[31] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, "Image captioning with semantic
attention," in Proc. Conf. Computer Vision and Pattern Recognition, 2016, pp.
4651-4659.
[32] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu, "Video paragraph captioning
using hierarchical recurrent neural networks," in Proc. Conf. Computer Vision and
Pattern Recognition, 2016, pp. 4584-4593.
[33] K. Tran, X. He, L. Zhang, J. Sun, C. Carapcea, C. Thrasher, C. Buehler, and,
and C. Sienkiewicz, "Rich image captioning in the wild. Deep Vision Workshop," in
Proc. Conf. Computer Vision and Pattern Recognition, 2016, pp. 434-441.
[34] S. Wu, J. Wieland, O. Farivar, and J. Schiller. "Automatic Alt-text: Computergenerated image descriptions for blind users on a social network service," in Proc.
20th ACM Conf. Computer Supported Cooperative Work and Social Computing,
2017.

[49] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, "From image descriptions to
visual denotations: New similarity metrics for semantic inference over event descriptions," in Proc. Association Computational Linguistics, vol. 2, 2014, pp. 67-78.
[50] D. Elliott and F. Keller, "Comparing automatic evaluation measures for image
description," in Proc. 52nd Annu. Meeting Association Computational Linguistics,
2014, pp. 452-457.
[51] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D.
Ramanan, C. Lawrence Zitnick, and P. Dollár, "Microsoft COCO: Common objects in
context," in Proc. European Conf. Computer Vision, 2015.
[52] Y. Cui, M. R. Ronchi, T.-Y. Lin, P. Dollár, L. Zitnick. (2015) COCO captioning
challenge. [Online]. Available: http://mscoco.org/dataset/#captions-challenge
[53] Microsoft Cognitive Services Computer Vision API. [Online]. Available: https://
www.microsoft.com/cognitive-services/en-us/computer-vision-api
[54] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, "Stacked attention networks for
image question answering," in Proc. Conf. Computer Vision and Pattern Recognition,
2016, pp. 21-29.
[55] A. Agrawal, J. Lu, S. Antol, M. Mitchell, L. Zitnick, D. Batra, and D. Parikh,
"VQA: Visual question answering," in Proc. Int. Conf. Computer Vision, 2015, pp.
2425-2433.
[56] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. F. Moura, D. Parikh, and
D. Batra, "Visual dialog," in Proc. Conf. Computer Vision and Pattern Recognition,
2017.
[57] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas,
"StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial
networks," in Proc. Int. Conf. Computer Vision, 2017.
[58] T.-H. (K. ). Huang, F. Ferraro, N. Mostafazadeh, I. Misra, A. Agrawal, J. Devlin,
R. Girshick, X. He, P. Kohli, D. Batra, C. Lawrence Zitnick, D. Parikh, L.
Vanderwende, M. Galley, and M. Mitchell, "Visual storytelling," in Proc. 2016 Conf.
North American Chapter Association Computational Linguistics: Human Language
Technologies, 2016, pp. 1233-1239.
[59] G. Dahl, D. Yu, L. Deng, and A. Acero, "Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition," IEEE Trans. Audio, Speech,
Lang. Process., vol. 20, pp. 30-42, Jan. 2012.

[35] C. Shallue. (2016). Open-source code on show and tell: A neural image caption
generator. [Online]. Available: https://github.com/tensorflow/models/tree/master/
im2txt

[60] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, and N. Jaitly, A, "Deep
neural networks for acoustic modeling in speech recognition," IEEE Signal Process.
Mag., vol. 29, pp. 82-97, Dec. 2012.

[36] L. Deng and D. Yu, Deep Learning: Methods and Applications, NOW
Publishers, 2014.

[61] K. Koenigsbauer, Microsoft Office Blogs. (2016). [Online]. Available: https://
blogs.office.com/2016/12/20/new-to-office-365-in-december-accessibility-updatesand-more/

[37] I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with
neural networks," in Proc. Conf. Neural Information Processing Systems, 2014.
[38] D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly
learning to align and translate," in Proc. Int. Conf. Learning Representations,
2015.
[39] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, "Gated feedback recurrent neural networks," in Proc. Int. Conf. Machine Learning, 2015.
[40] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural
Comput., vol. 98, pp. 1735-1780 1997.
[41] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "ImageNet: A
large-scale hierarchical image database," in Proc. Conf. Computer Vision and
Pattern Recognition, 2009, pp. 248-255.
[42] H. O. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darrell,
"On learning to localize objects with minimal supervision," in Proc. Int. Conf.
Machine Learning, 2014.
[43] C. Zhang, J. C. Platt, and P. A. Viola, "Multiple instance boosting for object
detection," in Proc. Conf. Neural Information Processing Systems, 2005.
[44] S. Banerjee and A. Lavie. "METEOR: An automatic metric for MT evaluation
with improved correlation with human judgments," in Proc. ACL Workshop on
Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or
Summarization, 2005.
[45] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "Bleu: a method for automatic evaluation of machine translation," in Proc. 40th Annu. Meeting Association
Computational Linguistics, 2002, pp. 311-318.
[46] R. Vedantam, L. Zitnick, and D. Parikh, "CIDEr: Consensus-based image
description evaluation," in Proc. European Conf. Computer Vision, 2015, pp.
4566-4575.
[47] P. Anderson, B. Fernando, M. Johnson, and S. Gould, "SPICE: Semantic propositional image caption evaluation," in Proc. European Conf. Computer Vision,
2016.

116

[48] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, "Collecting
image annotations using Amazon's mechanical turk," in Proc. NAACL HLT
Workshop Creating Speech and Language Data with Amazon's Mechanical
Turk, 2010.

[62] R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang, "A Siamese long short-term
memory architecture for human re-identification," in Proc. European Conf. Computer
Vision, 2016.
[63] J. Liu, A. Shahroudy, D. Xu, and G. Wang, "Spatio-temporal LSTM with trust
gates for 3D human action recognition," in Proc. European Conf. Computer Vision,
2016.
[64] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang,
"Bottom-up and top-down attention for image captioning and VQA," arXiv Preprint,
arXiv:1707.07998.
[65] Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, "Deep reinforcement learningbased image captioning with embedding reward," in Proc. Conf. Computer Vision and
Pattern Recognition, 2017.
[66] K. Lin, D. Li, X. He, Z. Zhang, and M.-T. Sun, "Adversarial ranking for language
generation," arXiv Preprint, arXiv:1705.11001
[67] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, "Self-critical
Sequence Training for Image Captioning," in Proc. Conf. Computer Vision and
Pattern Recognition, 2017.
[68] L. Yu, W. Zhang, J. Wang, and Y. Yu, "SeqGAN: Sequence generative adversarial
nets with policy gradient," in Proc. Association Advancement Artificial Intelligence,
2017.
[69] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA:
MIT Press 2016.
[70] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den Hengel, "Visual
question answering: A survey of methods and data sets," in Computer Vision and
Image Understanding. Elsevier, 2017.
[71] Seeing AI. [Online]. Available: https://www.microsoft.com/en-us/seeing-ai/
[72] S. Reed, Z. Akata, X. Yan, L. Logeswaran, H. Lee, and B. Schiel "Generative
adversarial text to image synthesis," in Proc. Int. Conf. Machine Learning, 2016.

SP



IEEE SIGNAL PROCESSING MAGAZINE

|

November 2017

|
http://www.mscoco.org/dataset/#captions-challenge http://https:// http://www.microsoft.com/cognitive-services/en-us/computer-vision-api https://www.github.com/tensorflow/models/tree/master/ http://https:// http://blogs.office.com/2016/12/20/new-to-office-365-in-december-accessibility-updates https://www.microsoft.com/en-us/seeing-ai/
Table of Contents for the Digital Edition of Signal Processing - November 2017