Signal Processing - November 2017 - 74

Australian Centre for Visual Technologies of the University
of Adelaide, where he works on computer vision and
machine learning. He was previously affiliated with
Carnegie Mellon University, Pittsburgh, Pennsylvania; the
University of Bath, United Kingdom; and the University of
Innsbruck, Austria.
Qi Wu (qi.wu01@adelaide.edu.au) received a bachelor's
degree in mathematical sciences from China Jiliang Uni-
versity, Hangzhou, and a master's degree in computer
science and a Ph.D. degree in computer vision from the
University of Bath, United Kingdom, in 2012 and 2015,
respectively. He is a postdoctoral researcher at the Australian
Centre for Robotic Vision of the University of Adelaide. His
research interests include cross-depiction object detection
and classification, attributes learning, neural networks, and
image captioning.
Anton van den Hengel (anton.vandenhengel@adelaide.
edu.au) received his bachelor's degree in mathematical science in 1991, his bachelor of laws degree in 1993, his master's degree in computer science in 1994, and his Ph.D.
degree in computer vision in 2000, all from the University of
Adelaide, Australia, where he is a professor and the founding
director of the Australian Centre for Visual Technologies.

References

[1] VQA challenge leaderboard. [Online]. Available: http://visualqa.org/ http://eva
lai.cloudcv.org
[2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L.
Zhang, "Bottom-up and top-down attention for image captioning and VQA," arXiv
Preprint, arXiv:1707.07998, 2017.
[3] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, "Learning to compose neural networks for question answering," in Proc. Annu. Conf. North American
Chapter Assoc. Computational Linguistics, San Diego, CA, 2016, pp. 1545-1554.
[4] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, "Neural module networks,"
In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 39-48.
[5] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D.
Parikh, "VQA: Visual question answering," in Proc. IEEE Int. Conf. Computer
Vision, 2015, pp. 2425-2433.
[6] Y. Atzmon, J. Berant, V. Kezami, A. Globerson, and G. Chechik, "Learning to
generalize to new compositions in image understanding," arXiv Preprint,
arXiv:1608.07639, 2016.
[7] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives,
DBpedia: A Nucleus for a Web of Open Data. New York: Springer, 2007.
[8] D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly
learning to align and translate," in Proc. Int. Conf. Learning Representation
(ICLR), San Diego, CA, 2015.
[9] H. Ben-younes, R. Cadène, M. Cord, and N. Thome, "MUTAN: multimodal
tucker fusion for visual question answering," arXiv Preprint, arXiv:1705.06676,
2017.
[10] J. Berant, A. Chou, R. Frostig, and P. Liang, "Semantic parsing on freebase
from question-answer pairs," in Proc. Conf. Empirical Methods Natural
Language Processing, 2013, pp. 1533-1544.
[11] L. Bertinetto, J. F. Henriques, J. Valmadre, P. H. S. Torr, and A. Vedaldi,
"Learning feed-forward one-shot learners," in Proc. Neural Information
Processing Systems (NIPS), 2016, pp. 523-531.
[12] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, "Freebase: A collaboratively created graph database for structuring human knowledge," in Proc.
ACM SIGMOD Int. Conf. Management of Data, 2008, pp. 1247-1250.
[13] A. Bordes, N. Usunier, S. Chopra, and J. Weston, "Large-scale simple question answering with memory networks," arXiv Preprint, arXiv:1506.02075,
2015.
[14] Q. Cai and A. Yates, "Large-scale semantic parsing via schema matching
and lexicon extension," in Proc. Conf. Association Computational Linguistics,
2013, pp. 423-433.
[15] R. Cantrell, M. Scheutz, P. Schermerhorn, and X. Wu, "Robust spoken
instruction understanding for hri," in Proc. 5th ACM/IEEE Int. Conf. HumanRobot Interaction, 2010, pp. 275-282.

74

[16] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005, vol.
1, pp. 886-893.
[17] A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra, "Human attention
in visual question answering: Do humans and deep networks look at the same
regions?" in Proc. Conf. Empirical Methods Natural Language Processing, 2016,
pp. 932-937.
[18] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and
D. Batra, "Visual dialog," in Proc. IEEE Conf. Computer Vision and Pattern
Recognition, 2017.
[19] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A
large-scale hierarchical image database," in Proc. IEEE Conf. Computer Vision
and Pattern Recognition, 2009, pp. 248-255.
[20] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár, J. Gao, X.
He, M. Mitchell, and J. Platt, "From captions to visual concepts and back," in
Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2015, pp. 1473-
1482.
[21] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach,
"Multimodal compact bilinear pooling for visual question answering and visual
grounding," in Proc. Conf. Empirical Methods Natural Language Processing
(EMNLP), 2016, pp. 457-468.
[22] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, "Are you talking to
a machine? Data set and methods for multilingual image question answering," in
Proc. Advances in Neural Information Processing Systems, 2015, pp. 2296-
2304.
[23] D. Geman, S. Geman, N. Hallonquist, and L. Younes, "Visual turing test for
computer vision systems," Proc. Natl. Acad. Sci., vol. 112, no. 12, pp. 3618-3623,
2015.
[24] R. Girshick, "Fast R-CNN," in Proc. IEEE Int. Conf. Computer Vision, 2015,
pp. 1440-1448.
[25] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, "Making the V
in VQA matter: Elevating the role of image understanding in visual question
answering," in Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (CVPR), 2017.
[26] A. Graves, G. Wayne, and I. Danihelka, "Neural turing machines," arXiv
Preprint, arXiv:1410.5401, 2014.
[27] L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. J. Mooney, K. Saenko,
and T. Darrell, "Deep compositional captioning: Describing novel object categories
without paired training data," in Proc. IEEE Conf. Computer Vision and Pattern
Recognition, 2015, pp. 1-10.
[28] F. Hill, A. Bordes, S. Chopra, and J. Weston, "The goldilocks principle:
Reading children's books with explicit memory representations," arXiv Preprint,
arXiv:1511.02301, 2015.
[29] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, "Learning to reason: End-to-end module networks for visual question answering," arXiv Preprint,
arXiv:1704.05526, 2017.
[30] A. Jabri, A. Joulin, and L. van der Maaten, "Revisiting visual question
answering baselines," in Proc. European Conf. Computer Vision (ECCV) 2016,
pp. 727-739.
[31] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R.
B. Girshick, "CLEVR: A diagnostic data set for compositional language and elementary visual reasoning," in Proc. Conf. Computer Vision and Pattern
Recognition (CVPR), 2017.
[32] J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, F. Li, C. L. Zitnick,
and R. B. Girshick, "Inferring and executing programs for visual reasoning,"
CoRR, 2017. [Online]. Available: http://arxiv.org/abs/1705.03633
[33] J.-H. Kim, S.-W. Lee, D.-H. Kwak, M.-O. Heo, J. Kim, J.-W. Ha, and B.-T.
Zhang, "Multimodal residual learning for visual QA," in Proc. Advances Neural
Information Processing Systems (NIPS), 2016, pp. 361-369.
[34] J.-H. Kim, K.-W. On, J. Kim, J.-W. Ha, and B.-T. Zhang, "Hadamard product
for low-rank bilinear pooling," arXiv Preprint, arXiv:1610.04325, 2016.
[35] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y.
Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei, "Visual
genome: Connecting language and vision using crowdsourced dense image annotations," arXiv Preprint, arXiv:1602.07332, 2016.
[36] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with
deep convolutional neural networks," in Proc. Advances in Neural Information
Processing Systems, 2012, pp. 1106-1114.
[37] A. Kumar, O. Irsoy, J. Su, J. Bradbury, R. English, B. Pierce, P. Ondruska, I.
Gulrajani, and R. Socher, "Ask me anything: Dynamic memory networks for natural language processing," in Proc. Int. Conf. Machine Learning, 2016, pp. 1378-
1387.
[38] A. Lazaridou, N. T. Pham, and M. Baroni, "Combining language and vision
with a multimodal skip-gram model," in Proc. Conf. North American Chapter
Assoc. Computational Linguistics-Human Language Technologies (HLTNAACL), 2015, pp. 153-163.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2017

|
http://www.edu.au http://www.visualqa.org/ http://eva http://lai.cloudcv.org http://www.arxiv.org/abs/1705.03633
Table of Contents for the Digital Edition of Signal Processing - November 2017