Signal Processing - November 2017 - 121

distributions are the same. Many nonlinear domain adaptation
methods apply the MMD in different ways to align the source
and target domains.
In [53], the authors apply MMD to introduce the domain
transfer SVM (DTSVM) for video concept detection. Some
other procedures apply MMD to reweigh the data points in the
source domain to select source data that are similar to the target
when training a domain adaptive classifier [42], [54], [55]. Spectral methods apply the MMD to achieve nonlinear alignment of
domains. Kernel-principal component analysis (Kernel-PCA) is
combined with MMD to determine nonlinear projections that
transform the source and target to a common subspace, as in [32],
[56], and [57]. Manifold-based approaches are also popular in
domain adaptation for computer vision, where the subspace of a
domain is treated as a point on the manifold. The curve connecting two subspaces is sampled to determine the transformations
that are necessary to transform the source subspace into the target subspace [58]. In [30], the authors determine a product of an
infinite number of such transformations that projects the source
subspace into the target subspace using the geodesic flow kernel.
These are some of the popular techniques for domain adaptation
without using deep networks. More detailed surveys of shallow
domain adaptation approaches can be found in [1], [59], and [60].
Likewise, the bounds for the expected error on the target and the
theoretical foundations of domain mismatch are outlined in [61]
and [62].

Insights
Recent years have seen deep-learning systems outperform
most nondeep-learning techniques across multiple problems in
computer vision, including domain adaptation. Does this mean
that shallow domain adaptation procedures are obsolete? Not
quite. Most of the deep learning domain adaptation procedures
are based on shallow domain adaptation techniques as outlined in the following sections [63]-[65]. Objective functions
based on shallow domain adaptation procedures guide deep
networks to extract highly adaptive representations. Research
and advances in shallow adaptation techniques are necessary
for progress in domain adaptation. Shallow methods do not
require expensive graphics processing unit systems for deep
learning. When training data sets are small or real-time performance is needed, shallow domain adaptation techniques are
preferred over deep-learning systems.

Deep-learning domain adaptation: Survey
In recent years, deep neural networks have revolutionized the
field of machine learning and computer vision. Deep-learningbased domain adaptation has outperformed nondeep-learning
algorithms because of the highly discriminatory nature of the
features extracted using deep neural networks. The progress of
research in computer vision can be directly linked to the advances in feature extraction and representation techniques. Feature
representation is the process of representing the spatial (or spatiotemporal) information in an image (or video) as a vector. Feature descriptors like SIFT and HOG are handcrafted techniques
for feature representation that are task and data agnostic. Feature

representations determined using deep networks are task and
data specific. The loss functions guide the network in determining the best features for a given data set to achieve a specific task.
This is the main advantage of using deep neural networks, which
becomes more evident in domain adaptation.
Shallow domain adaptation approaches are considered
to be fixed representation approaches. In a fixed representation approach, the features are predetermined and fixed, and
domain adaptation is performed using these predetermined
features. On the other hand, deep-learning-based domain
adaptation methods extract transferable feature representations specific to the data and the adaptation task at hand.
The unrivaled success of deep-learning methods in domain
adaptation can be attributed to this aspect. Feature representations using deep neural networks are highly nonlinear due to
multiple levels of nonlinearity in the feature extraction process. They are also termed hierarchical features due to the
hierarchical nature of the model and the nonlinear multilayer
structure of the network. In this section, we categorize the literature in domain adaptation based on these hierarchical feature representations.

Naïve hierarchical methods
Deep convolutional neural networks (CNNs) have been shown
to be very good feature extractors. Deep CNNs trained on millions of images are, by themselves, very good feature extractors, not just for the data set they are trained on, but for any
generic image. In [66], Razavian et al. have demonstrated how
a deep CNN trained on the ILSVRC 2013 ImageNet data set
[67] can be used for extracting generic features for any image.
Regular SVMs trained on these generic features have shown
astounding results across multiple applications like scene recognition, fine-grained recognition, attribute recognition, and
image retrieval. A pretrained CNN can be used to extract generic features for the source and the target. This can be termed as a
naïve form of domain adaptation.
Pretrained deep neural networks can also be fine-tuned to
the task at hand. It is well documented that the lower layers of a
CNN extract generic features that are common across multiple
tasks, and the upper layers extract task-specific features. Features transition from general to specific by the last layer of the
CNN. The work by Yosinski et al. in [68] captures the extent
of generality and specificity of neurons in each layer. Transferability has been shown to be negatively affected by two issues:
1) the specificity of neurons (to the source task) in the upper
layers adversely affects transfer to the target task, and 2) the
fragile nature of dependencies between layers that are task
specific inhibits the reuse of layers across different tasks. Adding new layers to a pretrained (trained on source data) network
and retraining it with target data is another intuitive method to
transfer knowledge in a deep-learning setting. When the entire
newly adapted network is fine-tuned with target data, it can
lead to a very efficient adaptation. This form of adaptation has
been explored in [69]. The authors demonstrate a procedure to
reuse the layers trained on the ImageNet data set to compute
midlevel representations for images. Despite the differences

IEEE SIGNAL PROCESSING MAGAZINE

November 2017

121

Table of Contents for the Digital Edition of Signal Processing - November 2017