Signal Processing - November 2017 - 100

rising popularity of DNNs, including models that implement multimodal fusion in biomedical applications involving
genomic, proteomic, and drug data.
Two major challenges in applying deep-learning-based
approaches for medical applications are 1) the difficulty in obtaining sufficiently labeled data and 2) the problem of class imbalance (where the number of negative examples far outnumber the
case of positive examples). To overcome the first challenge, early
approaches resorted to patch-based training [22]. Recently, techniques that utilize transfer learning have been surprisingly successful [23]. This involves reusing a portion of the data-agnostic
representations learned by very deep networks on a separate, large
data set, for example, ImageNet, and then fine-tuning or retraining only the upper layers of the network using a much smaller
medical data set. Another common technique is to perform training data augmentation, for example, applying different affine
transformations or randomly perturbing the brightness and contrast of images to increase the amount of training data available.
To address the data imbalance problem, it is common to apply
some form of weighting to the loss function such that mistakes
made on the majority classes are less penalized than mistakes the
network makes on the minority class. These challenges, although
very common to problems in the medical domain, may also occur
in other domains-and the solutions, as such, are equally applicable. However, despite the success of deep learning in medical
applications, the medical community is still rather apprehensive
about deploying them in the real world, as deep learning is often
seen as opaque. This view is likely to gradually change given the
increasing efforts to design interpretable DNNs [24], [25].

Autonomous systems
Following the success of deep learning, there has been a surge
of interest in autonomous navigation (also known as autonomous
driving) applications, which typically involve multimodal data
acquired from sensors mounted on the vehicle. A self-driving car,
for example, could include a number of external and internal sensors including radar, stereoscopic visible-light cameras, LIDAR,
infrared (IR) cameras, global positioning system (GPS), and
audio. To perform autonomous navigation, the heterogeneous data
collected from sensors are used for learning a number of interrelated but complex tasks such as localization and mapping, scene
understanding, movement planning, and driver-state recognition.
One of the biggest challenges for autonomous navigation
is the dynamic nature of the operating environment-the system has to adapt and be reactive to weather variations, lighting
variability, pedestrians and other traffic, road conditions, and
traffic signs, as well as the driver's state. Nevertheless, deeplearning and reinforcement-learning techniques [26] have been
instrumental in advancing this field of application with industry
players like Uber, Nvidia, Baidu, and Tesla actively involved in
the development of commercial self-driving cars.
An important task in autonomous driving is real-time
scene understanding. It requires the learning system to recognize objects in the scene, like lanes, traffic signs, pedestrians, and other traffic. It follows that, for each frame of the
multimodal video feed, semantic segmentation has to first be
100

performed. Each semantic concept identified in the scene then
has to be localized. For such tasks, deep fully convolutional
architectures that perform pixel-wise labeling of each frame
are often used [27]. For multimodal inputs, a common strategy is to concatenate synchronized frames across the channel
dimensions before being input to a fully convolutional neural
network (CNN) (this is, in a sense, early fusion) or, alternatively, to train separate modality-wise networks and then fuse
at a deeper stage in the multimodal network. We further discuss such fusion strategies in the section "Fusion Structure."
Semantic segmentation can be extended to video by using a
3-D variant of fully CNNs. The basic techniques used in selfdriving vehicle technology can be extended to other robotic
applications such as mobile robots or drone navigation, grasp
configuration learning [28], and robotic manipulation [29].

Summary
We have highlighted three major areas where deep multimodal learning approaches have gained a foothold and continue to
experience rapid advancements. In addition to the work already
highlighted with respect to each of these key areas, we list additional representative work involving deep multimodal learning in
Table 3. Several other application areas that involve text, images,
and video, e.g., visual question answering (VQA) and image and
video annotations, are highlighted in subsequent sections where
we discuss specific deep-learning models.

Models
Applying multimodal deep learning to a new problem involves
the selection of both an architecture and a learning algorithm.
Together, we will call these choices a model. A plethora of
different deep-learning models have been proposed for multimodal data. While there are several ways that they could be
partitioned and organized for review, we have opted to group
notable examples according to their learning paradigm, specifically whether they are generative, discriminative, or hybrid
models. Our reason for choosing this categorization is that this
choice impacts the available architectures and learning algorithms from which to select.
Generative models implicitly or explicitly represent a data
distribution, often allowing for new data to be sampled or "generated" through a process, hence their name. Discriminative
models, on the other hand, are less ambitious. Rather than
modeling distributions, they attempt to model class boundaries.
In the supervised learning setup, where we have data X and
labels Y, generative models learn the joint probability P (X, Y) .
In contrast, discriminative models are used for primarily prediction tasks, and these models learn the conditional P (Y | X) .
Yet generative models can still have discriminative properties.
An advantage of generative models is that they are much more
flexible. For example, P (X, Y) can be sampled in the case of
missing modalities during inference.

Discriminative models
Discriminative deep architectures directly model the mapping
from inputs to outputs, and the model parameters are learned

IEEE SIGNAL PROCESSING MAGAZINE

November 2017

Table of Contents for the Digital Edition of Signal Processing - November 2017