Signal Processing - November 2017 - 124

trained to distinguish between the features of the source and the
target. The key to the DaNN is the gradient reversal layer connecting the bottom feature extraction layers and the domain classifier. During back propagation, the gradient from the domain
classifier is reversed when learning the feature extractor weights.
This technique is popularly called the -gradient reversal. In this
way, the feature extractor is trained to extract domain invariant
features. A closely related work is presented in [87].
Liu and Tuzel implement a coupled GAN model (CoGAN)
in [88]. The CoGAN trains a coupled network, which shares
weights at different layers of the GAN. It is set up so that the
lower layers of the generators and the upper layers of the discriminators share weights. A common noise vector, z, is fed
into the two generators g 1 ($) and g 2 ($) to generate outputs
g 1 (z) and g 2 (z). These outputs are fed into two discriminators f1 ($) and f2 ($). Discriminator f1 ($) is trained to discriminate between g 1 (z) and the source x s . Likewise, discriminator
f2 ($) is trained to discriminate between g 2 (z) and the target
x t . Additionally, the source discriminator has a softmax layer
to classify the source data points x s . The CoGAN was tested
with MNIST and USPS data to yield impressive unsupervised
domain adaptation results. In an extension to the CoGAN, the
authors Liu et al. [89] develop an image translation network
that combines the CoGAN with a variational autoencoder [90].
This image translation network converts images in the target
domain to images in the source domain, which enables efficient classification of target data with a source classifier.

Insights
Adversarial methods are the latest trend in deep learning, and
they have shown remarkable performance in domain adaptation. When there is a need to model the source domain distribution, generative models like adversarial networks are beneficial.
GANs can be used for image translation [91], converting images
from one domain to the other [33], [89]. However, they require
large data sets to fully train a deep network since fine-tuning has
so far not been implemented with GANs.

Miscellaneous hierarchical methods
One of the earliest procedures for deep-learning domain adaptation was proposed by Chopra et al. [92]. The deep learning for
domain adaptation by interpolation between domains learns a
cross-domain representation by interpolating the path between
the source and target domains along the lines of [58]. Multiple
data sets with varying ratios of source and target data points are
sampled to create intermediate representations between the two
domains. The final cross-domain feature is a concatenation of all
of the intermediate feature representations.
Hu et al. [93] develop a metric learning method for supervised
transfer learning using clustering. The deep transfer metric learning model trains a deep neural network to minimize intraclass
distances and increase interclass distances. Additionally, the features of the last layer of the network are learned to be domain
invariant by minimizing the MMD between the source and
target feature outputs. Sener et al. [94] develop a deep-learning
approach that imputes the labels for the target in a transductive
124

learning environment. Using these imputed target data labels, the
largest margin, nearest-neighbor loss is applied to ensure cyclic
consistency of label assignment, and a k-nearest neighbor graph
over the target data points is applied to implement structural consistency. The deep network predicts the labels so as to minimize
intraclass distances and maximize interclass distances.
Sun et al. [95] develop a domain transfer method called the
localized action frame (LAF) for fine-grained action localization
in temporally untrimmed videos. The LAF motivates domain
transfer between weakly labeled web images and videos. The
domain transfer works in both directions: the video frames are
used to select web images that are relevant and drop nonaction
web images, and, in turn, the web images are used to select actionlike frames and drop nonaction frames in the video. After the relevant frames and images are selected, a long short-term memory
network to is used to train a fine-grained action detector to model
the temporal evolution of actions and classify the action in the
frames. The work also released a data set of sports videos with
more than 130,000 videos from 240 categories.
Bousmalis et al. [96] train domain separation networks to
extract domain-invariant feature representations and domainspecific representations of source and target data. A shared
encoder network E c (x) is trained to extract domain invariant
feature representations for the source and the target data. Private
encoder networks E sp (x) and E tp (x) for the source and target,
respectively, are trained to extract feature representations that
are distinct from the domain-invariant representations that are
the outputs of E c (x). A shared decoder network is trained to
reconstruct the original input data based on the outputs from
E c (x), E sp (x), and E tp (x). A classifier is trained with the source
outputs of E c (x). The feature representations that are the outputs
of E c (x) can be declared domain-invariant.

Insights
The miscellaneous hierarchical methods do not fall into any of
the previous categories. With the success of deep learning, there
has been a significant growth in the adoption of deep learning
for domain adaptation. So far, there has not been any classification system for the steadily increasing number of deeplearning domain adaptation models. This article provides a
preliminary classification system to guide researchers studying
deep-learning domain adaptation. With the introduction of new
data sets, the modeling of domain-shift, and advances in deep
domain adaptation, a more rigorous classification system can be
developed over time.

Directions for future research
While the problem of variations in data coming from different
distributions is outstanding, the solutions provided by domain
adaptation models are not easily applied to real-world applications in computer vision. This can be attributed to the manner in which domain adaptation models are currently being
developed and evaluated in the research community. From the
early approaches toward general domain adaptation [97] and
vision-based domain adaptation [58] to current-day models
[4], [33], the standard setup with well defined source and target

IEEE SIGNAL PROCESSING MAGAZINE

November 2017

Table of Contents for the Digital Edition of Signal Processing - November 2017