Signal Processing - May 2017 - 86

αn
Xn
...
X

a2 . . .

FIGURE 5. The visualization of the anchor-position vector a n [1].

thoroughly studied in [1], and the main
result is summarized below. A representative 2-D input and its corresponding
anchor vectors are shown in Figure 5. Let
a n be a K-dimensional vector formed by
the same position (or element) of a k. It is
called the anchor-position vector since
it captures the position information of
anchor vectors. Although anchor vectors
a k capture global representative patterns
of x, they are weak in capturing position
sensitive information. This shortcoming
can be compensated for by modulating
outputs with elements of the anchor-position vector a n in the next layer.
Let us use layers S4, C5, and F6 in
LeNet-5 as an example. There are 120
anchor vectors of dimension 400 from
S4 to C5. We collect 400 anchor-position
vectors of dimension 120, multiply the
output at C5 by them to form a set of
modulated outputs, and then compute 84
anchor vectors of dimension 120 from C5
to F6. Note that the output at C5 contains
primarily the spectral information but
not the position information. If a position
in the input vectors has less consistent
information, the variance of its associated
anchor position vector will be larger and
the modulated output will be more random. As a result, its impact on the formation of the 84 anchor vectors is reduced.
For more details, we refer to the discussion in [1].

New clustering representation
We have a one-to-one association between
a data sample and its cluster in traditional
clustering schemes. However, this is not
the case in the RECOS transform. A new
clustering representation is adopted by
MLPs and CNNs. That is, for an input
vector x, the RECOS transform generates a set of K nonnegative correlation
86

values as the output vector of dimension
K. This representation enables repetitive clustering layer by layer as given in
(4). For an input, one can determine the
significance of clusters according to the
magnitude of the rectified output value. If
its magnitude for a cluster is zero, x is not
associated with that cluster. A cluster is
called a relevant or irrelevant one depending on whether it has an association with
x. Among all relevant ones, we call cluster i the primary cluster for input x if
i = arg max a Tk x.
k

The remaining relevant ones are auxiliary clusters.
The FE subnet uses anchor vectors to
capture local, midrange, and long-range
spatial patterns. It is difficult to predict
the clustering structure since new information is introduced at a new layer. The
DM subnet attempts to reduce the dimension of intermediate representations until
it reaches the dimension of the decision
space. We observe that the clustering
structure becomes more obvious as the
layer of the DM subnet goes deeper. That
is, the output value from the primary cluster is closer to unity while the number of
auxiliary clusters is fewer and their output
values become smaller. When this happens, an anchor vector provides a good
approximation to the centroid for the corresponding cluster.
The choice of anchor vector numbers,
K l, at the lth layer is an important problem
in the network design. If input data x l -1
has a clear clustering structure (say, with
h clusters), we can set K l = h. However,
this is often not the case. If K l is set to a
value too small, we are not able to capture
the clustering structure of x l -1 well, and
it will demand more layers to split them.
If K l is set to a value too large, there are
more anchor vectors than needed, and a
stronger overlap between rectified output
vectors will be observed. As a result, we
still need more layers to separate them.
Another way to control the clustering process is the choice of the threshold value, z, of the TReLU. A higher
threshold value can reduce the negative
impact of a larger K l value. The tradeoff between z and K l is an interesting
future research topic.
IEEE Signal Processing Magazine

May 2017

Network initialization and guided
anchor vector update
Data clustering plays a critical role in the
understanding of the underlying structure
of data. The k-means algorithm, which is
probably the most well-known clustering
method, has been widely used in pattern
recognition and supervised/unsupervised
learning. As discussed previously, each
CNN layer conducts data clustering on
the surface of a high-dimensional sphere
based on a rectified geodesic distance.
Here, we would like to understand the
effect of multiple layers in cascade from
the input data source to the output decision label. For unsupervised learning such
as image segmentation, several challenges
exist in data clustering [8]. Questions such
as "What is a cluster?" "How many clusters are present in the data?" and "Are the
discovered clusters and partition valid?"
remain open. These questions expose
the limit of unsupervised data clustering methods.
In the context of supervised learning,
traditional feature-based methods extract
features from data, conduct clustering
in the feature space, and, finally, build a
connection between clusters and decision
labels. Although it is relatively easy to
build a connection between the data and
labels through features, it is challenging to
find effective features. In this setting, the
dimension of the feature space is usually
significantly smaller than that of the data
space. As a consequence, it is unavoidable
to sacrifice rich diversity of input data.
Furthermore, the feature selection process is guided by humans based on their
domain knowledge (i.e., the most discriminant properties of different objects). This
process is heuristic. It can become overfit
easily. Human efforts are needed in both
data labeling and feature design.
CNNs offer an effective supervised
learning solution, where supervision is
conducted by a training process using
data labels. This supervision closes the
semantic gap between low-level representations (e.g., the pixel representation)
and high-level semantics. Furthermore,
the CNN self-organization capability was well discussed in the 1980s and
1990s, e.g., [6]. By self-organization, the
network can learn with little supervision.
To put the above two together, we expect

Table of Contents for the Digital Edition of Signal Processing - May 2017