Signal Processing - May 2017 - 82

ture (i.e., perceptron layers) suitable for
parallel processing.
As compared with traditional pattern recognition techniques based on
simple linear analysis (e.g., linear discriminant analysis, principal component
analysis, etc.), MLPs provide a more
flexible mapping from the feature space
to the decision space, where the distribution of feature points of one class can be
nonconvex and irregular. It is built upon
a solid theoretical foundation proved by
Cybenko [4] and Hornik et al. [5]. That is,
a network with only one hidden layer can
be a universal approximator if there are
"enough" neurons.

Hidden 2
Hidden 1
Input

Output

FIGURE 1. An exemplary MLP with one input layer, two hidden layers, and one output layer.

Convolutional neural networks
denotes a nonlinear activation function,
and b is the intermediate result between
the two operations.
The function, f (.) , was chosen to
be a delayed step function in form of
f (b) = u (b - z) in [2], where f (b) = 1
if b $ z and 0 if b 1 z. A neuron is on
(in the one state) if the stimulus, b, is larger than threshold z. Otherwise, it is off
(in the zero state). Multiple neurons can
be flexibly connected into logic networks
as models in theoretical neurophysiology.
For the vision problem, input x
denotes an image (or image patch). The
neuron should not generate a response
for a flat patch since it does not carry
any visual pattern information. Thus, we
set b = 0 if all of its elements are equal
to a nonzero constant. It is then straightN
forward to derive / n = 1 a n + a 0 = 0
N
or a 0 =-/ n = 1 a n is a dependent variable. We can form augmented vectors
xl = (n, x 1, f , x N ) T ! R N +1 and al =
(a 0, a 1, f , a N ) T ! R N + 1 for x and a,
respectively. Without loss of generality,
we assume n = 0 in the following discussion. If n ! 0, we can either consider
the augmented vector space of xl or normalize input x to be a zero-mean vector
before the processing and add the mean
back after the processing.

Multilayer perceptrons
The perceptron was introduced by Rosenblatt in [3]. One can stack multiple perceptrons side by side to form a perceptron
layer and cascade multiple perceptron
layers into one network. It is called the
82

MLP or the feedforward neural network.
An exemplary MLP is shown in Figure 1. In general, it consists of a layer
of input nodes (the input layer), several
layers of intermediate nodes (the hidden
layers), and a layer of output nodes (the
output layer). These layers are indexed
from l = 0 to L, where the input and
output layers are indexed with 0 and L,
and the hidden layers are indexed from
l = 1gL - 1, respectively. Suppose that
there are N l nodes at the lth layer. Each
node at the lth layer takes all nodes in the
(i - 1) th layer as its input. For this reason, it is called the fully connected layer.
Clearly, the MLP is end-to-end fully connected. A modern CNN often contains an
MLP as its building module.
MLPs were studied intensively in the
1980s and 1990s as decision networks
for pattern recognition applications. The
input and output nodes represent selected
features and classification types, respectively. There are two major advances
from simple neuron-based logic networks
to MLPs. First, there was no training
mechanism in the former since they were
not designed for the machine-learning
purpose. The BP technique was introduced in MLPs as a training mechanism
for supervised learning. Since differentiation is needed in the BP yet the step
function is not differentiable, other nonlinear activation functions are adopted
in MLPs. Examples include the sigmoid
function, the rectified linear unit (ReLU)
and the parameterized ReLU (PReLU).
Second, MLPs have a modularized strucIEEE Signal Processing Magazine

May 2017

Fukushima's neocognitron [6] can be
viewed as an early form of a CNN. The
architecture introduced by LeCun et al. in
[7] serves as the basis of modern CNNs.
The main difference between MLPs
and CNNs lies in their input space-the
former are features while the latter are
source data such as image, video, speech,
etc. This is not a trivial difference. Let us
use the LeNet-5 shown in Figure 2 as an
example, whose input is an image of size
32 # 32. Each pixel is an input node. It
would be very challenging for an MLP
to handle this input since the dimension
of the input vector is 32 # 32 = 1, 024.
The diversity of possible visual patterns is
huge. As explained later, the nodes in the
first hidden layer should provide a good
representation for the input signal. Thus,
it implies a large number of nodes in hidden layers. The number of links (or filter
weights) between the input and the first
hidden layers is N 0 # N 1 due to full connection. This number can easily go to the
order of millions. If the image dimension
is in the order of millions such as those
captured by today's smartphones, the
solution is clearly unrealistic.
Instead of considering interactions of
all pixels in one step as done in the MLP,
the CNN decomposes an input image
into smaller patches, known as receptive fields, for nodes at certain layers. It
gradually enlarges the receptive field to
cover a larger portion of the image. For
example, the filter size of the first two
convolutional layers of LeNet-5 is 5 × 5.
The first convolutional layer considers

Table of Contents for the Digital Edition of Signal Processing - May 2017