Signal Processing - November 2017 - 106

these data sets typically involve person-centric visual understanding, with variants including emotion recognition, group behavior
analysis, etc. Table 2 lists a number of such data sets, the modalities
involved, and the problem domain. While this list is not exhaustive, we cover more recent data sets (many of which were released
in the past three years) that are available for multimodal research.
While most data sets include at least two modalities (images and
text, for example) or up to four (RGB-D, audio, and skeletal pose),
some data sets, for example, H-MOG [12], include up to nine different modalities. For the interested reader, Firman [90] presents
an extensive survey of 102 RGB-D data sets. Autonomous driving
and driver assistance systems (using driver behavior prediction)
are being pursued as a popular research topic in deep learning.
Such data sets are not only highly multimodal [91], with data
from up to six individual sensors, but also very large-hours of
data available. The Oxford RobotCar [92] data set, for example,
contains more than 23 TB of year-long driving data in various
weather conditions.
We note that there are relatively fewer multimodal medical
data sets available, possibly due to the cost and ethical and privacy concerns. Most medical data sets also tend to be much smaller,
involving between ten and 50 subjects and also suffer from high
class imbalances (for example, it is much more common to have
normal cases in comparison to abnormal cases). Medical informatics and imaging studies rely heavily on multimodal information, and this can be leveraged to improve computer-aided diagnosis. Efforts to gather and make such data sets publicly available
are encouraged.

Conclusions and future directions
In this article, we have reviewed recent advancements in deep
multimodal learning. It is undeniable that the incorporation of
multiple modalities into the learning problem almost always
results in much better performance for a wide range of problems. From a fusion perspective, we see that techniques in deep
multimodal learning can be classified into early- and late-fusion
approaches and that deep-learning methods facilitate a flexible
intermediate-fusion approach, which not only makes it simpler to
fuse modality-wise representations and learn a joint representation but also allows multimodal fusion at various depths in the
architecture. Although deep learning has, in many cases, reduced
the need for feature engineering, deep-learning architectures still
involve a great deal of manual design, and experimenters may not
have explored the full space of possible fusion architectures. It is
only natural that researchers should extend the notion of learning to architectures in an effort to have a truly generic learning
method, which can be adapted, with minimal or no human intervention, to a specific task.
We reviewed several options for learning an optimal architecture. This includes stochastic regularization, casting architecture
optimization as a hyperparameter optimization problem using,
for example, BO, and incremental online reinforcement learning.
This is, in our opinion, the most exciting area of research for deep
multimodal learning. Architecture learning can be extremely
compute-intensive, so researchers should take advantage of
advances in hardware acceleration and distributed deep learning.
106

We have also identified several application domains that are
gaining the most attention in deep multimodal learning. This
includes RGB-D and data from the multitude of sensors on
mobile phones that have been used for a range of problems
involving multimodal data such as human activity recognition and their variants. We foresee that this area will gain more
attention in the coming years for novel applications, which will
profoundly impact our daily lives. Another important area highlighted is medical research, which involves numerous modalities of data, some of which are very difficult to interpret without human experts in the loop. With the medical community
opening up to the rise of artificial intelligence-assisted diagnosis, we will see more significant progress being made in this
domain. Finally, two more application areas that are gaining
the attention of deep-learning researchers involve autonomous
vehicles or robotics and multimedia applications, for example,
video transcription, image captioning, etc. Novel applications
like online chatbots that use multimodal inputs, like images,
and text or recommender systems that utilize multimodal data
may become widespread in the near future.
We conclude by acknowledging that this is very much
a fast-evolving field, and, at the rate of the amount of new
research being published, many new innovations in deep
multimodal learning architectures are bound to be presented.
We have tried not to provide specific suggestions to architecture design, as we found many problems require applicationspecific considerations. Regardless, we feel this is a timely
publication as the directions of future research that we have
highlighted, hopefully, can act as a guide toward a more organized effort into advancing the research field.

Authors
Dhanesh Ramachandram (dramacha@uoguelph.ca) received
his B.Tech degree in industrial technology and his Ph.D.
degree in computer vision and robotics from the Universiti
Sains Malaysia in 1997 and 2003, respectively, where he was
formerly an associate professor. He is a researcher at the
University of Guelph, Ontario, Canada, and a Senior Member
of the IEEE. He is interested in deep learning for computer
vision, medical imaging, and multimodal problems.
Graham W. Taylor (gwtaylor@uoguelph.ca) received his
received his bachelor's and master's degrees in applied science
from the University of Waterloo, Canada, in 2003 and 2004,
respectively. He received his Ph.D. degree in computer science
from the University of Toronto, Canada, in 2009, where his thesis coadvisors were Geoffrey Hinton and Sam Roweis. He is an
associate professor at the University of Guelph, Ontario, Canada,
a member of the Vector Institute for Artificial Intelligence, and a
Canadian Institute for Advanced Research Azrieli Global
Scholar. He is interested in statistical machine learning and biologically inspired computer vision, with an emphasis on unsupervised learning and time-series analysis.

References

[1] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553,
pp. 436-444, 2015.

IEEE SIGNAL PROCESSING MAGAZINE

November 2017

Table of Contents for the Digital Edition of Signal Processing - November 2017