Signal Processing - November 2017 - 75

[39] G. Lin, A. Milan, C. Shen, and I. Reid, "RefineNet: Multi-path refinement
networks for high-resolution semantic segmentation," in Proc. Conf. Computer
Vision and Pattern Recognition (CVPR), July 2017.
[40] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P.
Dollár, and C. L. Zitnick, "Microsoft COCO: Common objects in context," in
Proc. European Conf. Computer Vision, 2014, pp. 740-755.

[66] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, "Weakly supervised memory
networks," arXiv Preprint, arXiv:1503.08895, 2015.
[67] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler,
"Movieqa: Understanding stories in movies through question-answering," in Proc.
IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 4631-4640.

[41] D. G. Lowe, "Object recognition from local scale-invariant features," in
Proc. IEEE Int. Conf. Computer Vision, 1999, vol. 2, pp. 1150-1157.

[68] D. Teney, P. Anderson, X. He, and A. van den Hengel, "Tips and tricks for visual
question answering: Learnings from the 2017 challenge," arXiv Preprint,
arXiv:1708.02711, 2017.

[42] J. Lu, X. Lin, D. Batra, and D. Parikh. (2015). Deeper lstm and normalized
CNN visual question answering model [Online]. Available: https://github.com/
VT-vision-lab/VQA_LSTM_CNN

[69] D. Teney, L. Liu, and A. van den Hengel, "Graph-structured representations for
visual question answering," in Proc. IEEE Conf. Computer Vision and Pattern
Recognition, 2017.

[43] J. Lu, J. Yang, D. Batra, and D. Parikh, "Hierarchical question-image coattention for visual question answering," in Proc. Advances Neural Information
Processing Systems, 2016, pp. 289-297.

[70] D. Teney and A. van den Hengel, "Zero-shot visual question answering," arXiv
Preprint, arXiv: 1611.05546. 2016.

[44] L. Ma, Z. Lu, and H. Li, "Learning to answer questions from image using
convolutional neural network," in Proc. 30th AAAI Conference on Artificial
Intelligence, 2016, pp. 3567-3573.
[45] M. Malinowski and M. Fritz, "A multi-world approach to question answering about real-world scenes based on uncertain input," in Proc. Advances Neural
Information Processing Systems, 2014, pp. 1682-1690.
[46] M. Malinowski, M. Rohrbach, and M. Fritz, "Ask your neurons: A neuralbased approach to answering questions about images," in Proc. IEEE Int. Conf.
Computer Vision, 2015, pp. 1-9.
[47] C. Matuszek, N. FitzGerald, L. Zettlemoyer, L. Bo, and D. Fox, "A joint
model of language and perception for grounded attribute learning," in Proc. Int.
Conf. Machine Learning, 2012, pp. 1671-1678.

[71] I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun, "Order-embeddings of images and
language," in Proc. Int. Conf. Learning Representations, 2016.
[72] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra,
"Matching networks for one shot learning," in Proc. Neural Information Processing
System (NIPS), 2016, pp. 3630-3638.
[73] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: A neural image
caption generator," in Proc. IEEE Conf. Computer Vision and Pattern Recognition,
2014, pp. 3156-3164.
[74] P. Wang, Q. Wu, C. Shen, and A. v d. Hengel, "The VQA-machine: Learning how
to use existing vision algorithms to answer new questions," arXiv Preprint,
arXiv:1612.05386, 2016.
[75] P. Wang, Q. Wu, C. Shen, A. v d. Hengel, and A. Dick, "Explicit knowledge-based
reasoning for visual question answering," arXiv Preprint, arXiv:1511.02570, 2015.

[48] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of
word representations in vector space," arXiv Preprint, arXiv:1301.3781, 2013.

[76] P. Wang, Q. Wu, C. Shen, A. v d. Hengel, and A. Dick, "FVQA: Fact-based visual
question answering," arXiv Preprint, arXiv:1606.05433, 2016.

[49] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed
representations of words and phrases and their compositionality," in Proc.
Advances in Neural Information Processing Systems, 2013, pp. 3111-3119.

[77] J. Weston, S. Chopra, and A. Bordes, "Memory networks," arXiv Preprint,
arXiv:11410.3916, 2015.

[50] K. W. Murray and J. Krishnamurthy, "Probabilistic neural programs," arXiv
Preprint, arXiv:1612.00712, 2016.
[51] H. Noh, P. H. Seo, and B. Han, "Image question answering using convolutional neural network with dynamic parameter prediction," in Proc. IEEE Conf.
Computer Vision Pattern Recognition, 2016, pp. 30-38.

[78] T. Winograd, "Understanding natural language," Cognit. Psychol., vol. 3, no. 1,
pp. 1-191, 1972.
[79] Q. Wu, C. Shen, A. v. d. Hengel, L. Liu, and A. Dick, "What value do explicit
high level concepts have in vision to language problems?" in Proc. IEEE Conf.
Computer Vision and Pattern Recognition, 2016, pp. 203-212.

[52] B. Peng, Z. Lu, H. Li, and K. Wong, "Toward neural network-based reasoning," arXiv Preprint, arXiv:1508.05508, 2015.

[80] Q. Wu, C. Shen, A. v d. Hengel, P. Wang, and A. Dick, "Image captioning and
visual question answering based on attributes and their related external knowledge,"
arXiv Preprint, arXiv:1603.02814, 2016.

[53] J. Pennington, R. Socher, and C. Manning, "Glove: global vectors for word
representation," in Proc. Conf. Empirical Methods Natural Language
Processing, 2014, pp. 1532-1543.

[81] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den Hengel, "Visual
question answering: a survey of methods and data sets," Computer Vision and Image
Understanding, to be published.

[54] S. K. Ramakrishnan, A. Pal, G. Sharma, and A. Mittal, "An empirical evaluation of visual question answering for novel objects," arXiv Preprint,
arXiv:1704.02516, 2017.

[82] Q. Wu, P. Wang, C. Shen, A. Dick, and A. v. d. Hengel, "Ask me anything: Freeform visual question answering based on knowledge from external sources," in Proc.
IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 4622-4630.

[55] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, "Cnn features
off-the-shelf: An astounding baseline for recognition," in Proc. IEEE Conf.
Computer Vision and Pattern Recognition Workshops, 2014, pp. 806-813.

[83] C. Xiong, S. Merity, and R. Socher, "Dynamic memory networks for visual and
textual question answering," in Proc. Int. Conf. Machine Learning, 2016, pp. 2397-
2406.

[56] S. E. Reed and N. de Freitas, "Neural programmer-interpreters," in Proc.
Int. Conf. Learning Representations, 2016.

[84] C. Xiong, V. Zhong, and R. Socher, "Dynamic coattention networks for question
answering," arXiv Preprint, arXiv:1611.01604, 2016.

[57] M. Ren, R. Kiros, and R. Zemel, "Image question answering: a visual
semantic embedding model and a new data set," in Proc. Advances Neural
Information Processing Systems, 2015.

[85] H. Xu and K. Saenko, "Ask, attend and answer: exploring question-guided spatial
attention for visual question answering," arXiv Preprint, arXiv:1511.05234, 2015.

[58] R. A. Rensink, "The dynamic representation of scenes," Visual Cognition,
vol. 7, no. 1-3, pp. 17-42, 2000.

[86] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio,
"Show, attend and tell: neural image caption generation with visual attention," in Proc.
Int. Conf. Machine Learning, 2015, pp. 2048-2057.

[59] D. Roy, K.-Y. Hsiao, and N. Mavridis, "Conversational robots: building
blocks for grounding word meaning," in Proc. HLT-NAACL Workshop on
Learning Word Meaning Non-Linguistic Data, 2003, pp. 70-77.

[87] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, "Stacked attention networks for
image question answering," in Proc. IEEE Conf. Computer Vision and Pattern
Recognition, 2016, pp. 21-29.

[60] K. Saito, A. Shin, Y. Ushiku, and T. Harada, "Dualnet: Domain-invariant
network for visual question answering," arXiv Preprint, arXiv:1606.06108,
2016.

[88] X. Yao and B. Van Durme, "Information extraction over structured data: Question
answering with freebase," in Proc. Conf. Association Computational Linguistics,
2014, pp. 956-966.

[61] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. P. Lillicrap,
"Meta-learning with memory-augmented neural networks," in Proc. Int. Conf.
Machine Learning, 2016, vol. 48, pp. 1842-1850.

[89] K.-H. Zeng, T.-H. Chen, C.-Y. Chuang, Y.-H. Liao, J. C. Niebles, and M. Sun,
"Leveraging video descriptions to learn video question answering," in Proc. Conf
Artificial Intelligence AAAI, 2017, pp. 4334-4340.

[62] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P.
Battaglia, and T. Lillicrap, "A simple neural network module for relational reasoning," arXiv Preprint, arXiv:1706.01427, 2017.

[90] P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh, "Yin and yang:
Balancing and answering binary visual questions," in Proc. IEEE Conf. Computer
Vision and Pattern Recognition, 2016, pp. 5014-5022.

[63] M. J. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, "Bidirectional attention flow for machine comprehension," arXiv Preprint, arXiv:1611.01603, 2016.

[91] L. Zhu, Z. Xu, Y. Yang, and A. G. Hauptmann, "Uncovering temporal context for video question and answering," arXiv Preprint, arXiv:1511.04670, 2015.

[64] P. Sermanet, A. Frome, and E. Real, "Attention for fine-grained categorization," arXiv Preprint, arXiv:1412.7054, 2014.

[92] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei, "Visual7W: Grounded
question answering in images," in Proc. IEEE Conf. Computer Vision and
Pattern Recognition, 2016, pp. 4995-5004.

[65] K. J. Shih, S. Singh, and D. Hoiem, "Where to look: Focus regions for visual question answering," in Proc. IEEE Conf. Computer Vision Pattern
Recognition, 2016, pp. 4613-4621.

IEEE SIGNAL PROCESSING MAGAZINE

SP



|

November 2017

|

75
https://www.github.com/
Table of Contents for the Digital Edition of Signal Processing - November 2017