IEEE Circuits and Systems Magazine - Q4 2019 - 23

After the ImageNet challenge2 in 2012, CNNs began to
rule the field of image classification [30]. New innovations have been made every year and the research in
CNNs has proliferated. The AlexNet, as the winner of
2012 challenge, was considered to be the break-through
CNN variant with a top-5 test error rate (a score used
to indicate if the target label is one of the top 5 predictions with the highest probabilities) of only 17.0% on ILSVRC-2010 test set [31]. In 2013, the ZF Net gave a better
accuracy based on a slight modification of the AlexNet
model and proposed a novel feature maps visualization
[32]. Another AlexNet type architecture based CNN is
the VGG Net and it was frequently applied in the community because of its appealing uniform network architecture [33]. Among all the submissions, the GoogLeNet
proposed by Google won the 2014 ImageNet challenge
[34]. In GoogLeNet, the core innovation is a repeated
inception module, with which the number of the parameters can be significantly reduced. The next popular
CNN is the Microsoft ResNet (residual network) which
was the winner of the 2015 ImageNet challenge [35]. The
residual block was proposed in this network to guarantee the gradients computed can be directly used into a
more effective weights updating process.
The Autoencoders (AEs) are another popular type
of unsupervised pre-training deep feedforward neural
network (FNN) [36]. Different variations of AEs have
been introduced to enhance the ability of extracting
informative representations, including the denoising
autoencoder [37], [38], the sparse autoencoder [39], the
variational autoencoder (VAE) [40] and the contractive
autoencoder (CAE) [41].
In speech/acoustic recognition and other pattern
recognition problems, generative and discriminative
models are the two main approaches. The generative
models attempt to provide the distribution of data or
the joint distribution of data and the corresponding targets, while the discriminative models directly predict
the distribution of targets conditioned on the data [42].
The Generative Adversarial Networks (GANs) proposed
by Goodfellow can be considered as a hybrid model
consisting of both the generative and the discriminative
models [43]. In the GAN's framework, a generator G and
a discriminator D are trained simultaneously to capture
the data distribution and to estimate the probability
that an output is from the training data.
With the rapid development of efficient computation techniques and the growth of computing power,
implementing large-scale deep learning approaches
with computers is no longer a fantasy. The advent of
fast Graphic Processing Units (GPUs) and the avail2

http://www.image-net.org/.

FOURTH QUARTER 2019

ability of huge amounts of data significantly expedite
the training and fine-tuning of deep learning models.
For speech synthesis, deep learning aims to remove
the restrictions of the conventional method in statistical parametric synthesis based on GaussianHidden Markov Models and other classical models
[44]. A properly designed acoustic model plays an
important role in HMM-based statistical parametric
speech synthesis systems. Generated speech from
shallow-structured HMM-based synthesizers have
been generally known for poor fidelity compared
with natural speech. To address this problem, deep
learning approaches are adopted to offset this deficiency. Generative models like Restricted Boltzmann
Machines (RBMs) and Deep Belief Networks (DBNs)
have displaced traditional Gaussian models and concrete improvements have been obtained with regard
to the quality of speech [45], [46]. In contrast to the
generative deep models, discriminative models of the
DNN can also be applied for performing speech synthesis. In [47], a DNN-based approach was proposed
to predict spectral and excitation parameters. A better quality of synthetic speech was achieved with a
similar number of parameters to the HMM-based synthesizer. For voice conversion, deep learning is usually used for learning the associations between the
spectral mapping features, which allows modification
of speech properties in a VC system. In [48], a twolayer feedback neural network based on bidirectional
associative memory was reformulated to model the
spectral envelope space of the speech signal. Experimental results showed that the proposed method
has a better modeling ability than the traditional
use of Gaussians with diagonal covariance. In [49], a
stacked joint-autoencoder was applied to construct
a regression function, which was used in a VC task.
This method demonstrated features that do not suffer from the averaging effect inherent with the backpropagation algorithm.
As a fast growing domain, deep learning has recently been applied for musical audio synthesis and music
composition. A fundamental motivation is to learn a
representation of diverse musical styles or musical content based on several widely available databases. Generally, research works focus on four main dimensions
of music generation: the objective (e.g., a melody or a
sequence of chords), the representation (e.g., raw musical signal or spectral features), the architecture (e.g.,
DNNs, CNNs or RNNs) and the learning strategy [17]. In
[50], an autoencoder based synthesizer was proposed
for compressing and reconstructing magnitude short
time Fourier transform frames. This synthesizer was
re-trained for music synthesis applications and several
IEEE CIRCUITS AND SYSTEMS MAGAZINE

IEEE Circuits and Systems Magazine - Q4 2019

Table of Contents for the Digital Edition of IEEE Circuits and Systems Magazine - Q4 2019