Signal Processing - November 2017 - 105

depth of representation, is usually based on intuition (for example, fusing similar modalities early, and then fusing disparate
modalities at a deeper layer). When more than two modalities
are involved, also depending upon the nature of the modalities
being used in the problem, choosing an optimal fusion architecture may be more challenging. A natural progression would
be to search for an optimal multimodal fusion architecture by
casting this as a model search or structure learning problem.
Neural network structure optimization for unimodal
problems has long been investigated by machine-learning
researchers. These mainly involved determining the optimal
number of neurons and number of layers in a network. There
is a tradeoff between good generalization ability of the network, and the number of parameters and availability of training data. Too large a network might perform well or overfit,
depending if it is trained with sufficiently large training data,
while too small a network, might underfit and may result in
poor generalization.
A common approach is to adopt a bottom-up constructive
approach. The basic idea proposed by Elman [73] is to start
with a relatively small network and add hidden units or layers
incrementally until the best performing architecture is found.
More recently, and in the large-scale setting, Chen et al. [74]
gradually added depth and width to an inception-style [75]
network by knowledge transfer between one neural network
to another.
Pruning algorithms [76] address the same problem from
a top-down approach. Recent approaches for DNNs include
the works of Feng and Darrel [77], who proposed an evolving
grow-and-prune algorithm that optimizes the structure of an
Indian buffet process-CNN model, and Yang et al. [78], who
introduced network pruning for large, diverse data sets based
on sparse representations.
Genetic algorithm (GA)-based structure optimization of
neural networks was one of the earliest metaheuristic search
algorithms used for neural network structure search and optimization [79]. In the early 2000s, an algorithm called Neuro
Evolution of Augmenting Topologies (NEAT) [80] that also
used GAs to evolve increasingly complex neural network
architectures received much attention. More recently, Shinozaki and Watanabe [81] applied GAs and a covariance matrix
evolution strategy to optimize the structure of a DNN, parameterizing the structure of the DNN as a simple binary vector
based on a directed acyclic graph representation. As the GA
search space can be very large, and each model evaluation in
the search space is expensive, a parallel search using a large
GPU cluster was used to speed up the process.
These neural network structural search and optimization
techniques can readily be extended to the multimodal setting if a suitable representation of the network architecture
is devised and provided that the cost of training and testing
multiple architectures during the search process is not prohibitively expensive. With data set sizes approaching gigabytes,
and even terabyte levels, and deep network architectures
involving millions of parameters and multiple modalities,
search and optimization of multimodal fusion structure can

be prohibitively expensive unless some parallel search procedure is implemented or an efficient optimization algorithm
is used. While Bayesian optimization (BO) [82] has been a
popular choice for hyperparameter optimization, it has been
recently used for multimodal fusion architecture optimization [83]. Architecture optimization was cast as a discrete
optimization problem by searching a space of all possible
multimodal fusion architectures using a Gaussian processbased BO. A novel graph-induced kernel was proposed to
quantify the distance between different architectures in the
search space.
Reinforcement learning [84] has also been used for deep
neural architecture search [85]. This work proposed a novel
method of using an RNN to generate variable-length model
descriptions of neural networks. The RNN was trained with
reinforcement learning to maximize the expected accuracy
of the generated architectures on a validation set.
A number of recent works have approached structure learning as a means of regularization, or capacity control, in a network. By pruning the network in a stochastic manner, stochastic
regularization methods can be considered as a kind of ensemble that improves generalization via model averaging. Kulkarni et al. [86] implemented a method of learning the structure of
DNNs via deterministic regularization. They insert, between
each pair of fully connected layers, a sparse diagonal matrix
whose entries are l 1 penalized. This implicitly defines the size
of the effective weight matrices at each layer. The approach
has a similar effect to Dropout [87]. Blockout [88] can perform simultaneous regularization and model selection through
a clever technique that stochastically assigns hidden units to
"clusters," forming block-structured weight matrices. In addition, by averaging the outputs of multiple stochastic inference
passes (which can be viewed as a case of ensemble classifiers),
results better than ResNets were achieved. This architecture
effectively implements a late fusion of multiple architectures
to achieve better results.
Stochastic regularization has been extended to the multimodal
setting by Neverova et al. [8] and, more recently, by Li et al. [89].
In the latter work, the authors show that, when the intermodality
correlation is high, an early-fusion approach (whose fusion structure was learned by the network) produced better results, while a
late-fusion approach worked better when the input modalities are
less correlated. This concurred with the empirical choice made
by the former.
In this section, we have covered a number of recent works that
use either stochastic regularization or optimization resulting in
deep multimodal fusion architectures that perform at par with or
better than meticulously designed ones. While feature engineering has been largely solved by deep representation learning, the
next logical step would be to do away with meticulous engineering of deep architectures and pursue techniques that achieve
this automatically.

Data sets
To facilitate research in multimodal learning, a number of data
sets have been released to the public. We note that the majority of

IEEE SIGNAL PROCESSING MAGAZINE

November 2017

105

Table of Contents for the Digital Edition of Signal Processing - November 2017