Signal Processing - November 2017 - 81

objective on the lifted problem. Experiments on three largescale data sets demonstrated significant improvements over
existing deep feature embedding methods.
Cui et al. [29] presented an iterative framework for finegrained visual categorization with humans in the loop information. Their method can handle three challenges in existing
fine-grained visual categorization methods: lacking of training
data, large number of fine-grained categories, and high intraclass
versus low interclass variance. Using DML with humans in the
loop, a low-dimensional feature embedding with anchor points
on manifolds was learned for each category, where these anchor
points captured intraclass variances and remained discriminative
among different classes. In each round, images with high confidence scores from our model were sent to humans for labeling. By comparing these images with exemplar images, labelers
marked each candidate image as either a true positive or a false
positive. True positives were added into the current data set and
false positives were considered as hard negatives for the DML
model. Then the model was retrained with an expanded data set
and hard negatives for the next round iteration. The proposed
DML method was evaluated on two fine-grained data sets.
Experimental evaluations showed that their method achieved significant performance gain over state-of-the-art methods.
Shi et al. [31] proposed a deep metric embedding method
with triplet loss for person reidentification. Their method introduced a positive sample mining method to train robust CNN for
person reidentification. In addition, a metric weight constraint
was used to improve the learning, so that the learned metric
has a better generalization ability. They empirically found that
both of these tricks improve the reidentification performance.
Lim et al. [33] proposed a competitive approach for style similarity learning of three-dimensional (3-D) shapes using DML,
which made use of recent advances in triplet based metric learning with neural networks. The key advantages of their method
are four aspects:
■■ it explored DML techniques for perceived style similarities of
3-D shapes
■■ it showed that rendered images of 3-D geometry from multiple viewpoints were an appropriate representation and how
salient views can be selected
■■ it used a triplet sampling method that does not rely on style
class labels and allows for an efficient learning procedure
■■ it showed how heterogeneous data sources in the form of 3-D
geometry and annotated photographs found online can be
integrated into the DML method.

DML via other networks
There are also some DML methods via other networks. For example, Batchelor and Green [22] proposed using DML on CNNs to
learn features with good locality for object recognition. In particular, they considered two metric learning methods: neighborhood
components analysis and mean square error's gradient minimization (MEGM). They utilized a nonlinear form of MEGM as an
alternative to neighborhood components analysis and proposed
some stochastic sampling methods to apply them to larger data
sets with a minibatch stochastic gradient descent algorithm.

Sohn [32] proposed a DML method using multiclass N-pair
loss [32]. Their method first generated triplet loss by allowing
joint comparison among more than one negative example. Then,
N - 1 negative examples were considered to reduce the computational burden of evaluating deep embedding vectors. They demonstrated the superiority of their method over other competing
loss functions for a variety of tasks such as fine-grained object
recognition and -verification, image clustering and retrieval, and
face verification and i-dentification.

Visual understanding applications
In this section, we show various visual understanding ap--pli-
cations via DML, including face recognition, image classification, visual search, person re-identification, visual tracking,
cross-modal matching, and image set classification.

Face recognition
Chopra et al. [18] learned a similarity metric for face verification.
Their approach learned a CNN-based mapping from the input space
to the target space, where the L 1 norm can directly approximate the
semantic distance. Cai et al. [20] learned a nonlinear metric using
the deep ISA network. Compared with kernel-based methods, deep
models present strong discriminative power and better exploit the
nature of the data set. Sun et al. [7] proposed a DeepID2 method to
increase the interpersonal variations with the identification signals,
and reduce the intrapersonal distances with the verification signals.
Taigman et al. [21] presented the DeepFace network by exploiting
a 3-D face model and training a nine-layer CNN network. Hu et
al. [6] presented a DDML method by learning a set of hierarchical
nonlinear transformations, where the distance between positive pairs
is smaller than negative pairs by a threshold. They also proposed a
DTML [23] for cross-data set face recognition. DTML transferred
the information from the labeled source domain to the unlabeled target domain, and minimized their distribution divergence. Schroff et
al. [24] proposed a FaceNet method by learning a projection to map
facial images to a compact Euclidean space. With the learned embedding, feature vectors can be directly used to measure the similarity of
faces. Most these DML methods achieved the state-of-the-art performance on the widely used LFW and YouTube Face data sets.

Image classification
Batchelor and Green [22] utilized CNN architecture to learn a
deep nonlinear metric, where the learned features with good locality show good performance and generalization for image classification. Hoffer and Ailon [26] utilized a triplet-based network to
learn deep metrics by distance comparisons. The triplet network
contains three instances of networks with shared parameters,
where three samples with a positive pair and a negative pair can be
simultaneously fed into the network. Cui et al. [29] learned a deep
metric for fine-grained categorization. Human helps to label high
confidence images in each loop to expand data sets and hard negatives, where the network was further retrained in the next loop.

Visual search
Wu et al. [8] proposed an online multimodal deep similarity learning for visual search. They applied deep-learning techniques to

IEEE SIGNAL PROCESSING MAGAZINE

November 2017

Table of Contents for the Digital Edition of Signal Processing - November 2017