IEEE Circuits and Systems Magazine - Q4 2019 - 30

first RBM as the input. New weights and biases are assigned for the new visible layer to train the new RBM
by the CD algorithm again. This process is iterated until
reaching a desired stopping criterion.
Note that DBNs can be applied for the purpose of
supervised learning by adding a final layer of variables
that represent the desired outputs and backpropagating
error derivatives. An alternative way is to employ the
weights from a trained DBN for the pre-training of an NN
for classification tasks.
B. Applications
1) Speech Generation
Recently, generative models have been adopted as powerful tools for speech waveform production. WaveNet
[88], [89], proposed by the DeepMind group of Google,
has caught the attention of the speech synthesis community. WaveNet is a deep generative model of raw
audio waveforms and is able to create a human voice
which sounds very natural. It is composed of fully convolutional neural networks, involving various dilation
factors. The receptive field is enlarged exponentially
with the depth of the network and spans a large number
of timesteps. For training, raw recordings are fed as input sequences. Synthetic utterances can be generated
by sampling the trained network. Note that for producing meaningful utterances, the network's predictions
are conditioned on both the audio samples and the
text to be spoken. By conditioning the network on the
identity of the speaker, one can use WaveNet to generate the same sentence in different voices. Interestingly,
WaveNet can also be applied to model other kinds of
acoustic signals, such as music. Now that WaveNet has
been embedded in the Google Assistant application, it
has received a lot of positive feedback [89].
As an amazing production-quality speech synthesis system, the WaveNet and its variants are embraced
by many companies and research labs. Deep Voice, a
truly end-to-end neural speech synthesizer employing

v1
h1
v2

the WaveNet, was proposed by the Baidu Silicon Valley
AI Lab in 2017 [90]-[92]. A variant of WaveNet is implemented for the audio synthesis model in this system
with fewer parameters being required. In Deep Voice 2,
the Tacotron [93], a text-to-speech synthesis system,
was combined with a WaveNet-based spectrogram-toaudio vocoder. Evaluation results demonstrated that
high-quality speech can be achieved by this integrated
TTS synthesis system. A similar work describes a natural TTS synthesizer Tacotron 2 [94]. This Tacotron 2 was
composed of a recurrent network, which was for mapping character embeddings to Mel-scale spectrograms,
and a modified WaveNet model, which acted as a vocoder to yield time-domain waveforms from those spectrograms. This model can achieve a mean opinion score
(MOS) of 4.53/5.00, which suggested very high-fidelity
generated speech.
In [45], the RBM and DBN were applied to represent
the distribution of the low-level spectral envelopes
used for HMM-based parametric speech synthesis. Experimental results showed that both modified methods
were able to generate spectral envelope parameter sequences better than the conventional Gaussian-HMM
based approach. In [95], a deep autoencoder structure
was proposed to extract robust spectral features for statistical parametric speech synthesis systems. By using
the autoencoder, low-dimensional features can be compressed from the original high dimensional spectral envelope without degradation. A similar feature extraction
procedure is also explored in [83], [96], [97].
For voice conversion, generative algorithms have
been applied in many frameworks and systems to improve the naturalness, clarity and speaker individuality.
In [49], a regression function constructed by a stacked
joint-autoencoder was applied to a voice conversion
task. Subjective listening tests were carried out to prove
that the proposed approach has a higher quality and
similarity than another system integrated with DNNs.
In [98], a statistical voice conversion technique with the
WaveNet-based waveform generation was introduced.
The waveform samples of the converted voice were created by a WaveNet vocoder conditioned on the converted acoustic features. The experimental results showed
that a higher conversion accuracy on speaker individuality was achieved with the proposed VC method, compared to the conventional VC techniques.

h2
v3
Visible
Layer

Hidden
Layer

Figure 12. An example of an RBM.
30

IEEE CIRCUITS AND SYSTEMS MAGAZINE

2) Other Types of Audio Signal
In [50], an autoencoder based music synthesizer was
presented. The autoencoder adopted was built by a
four layer deep topology and both sigmoid and ReLU
activations were used. In [99] and [100], a WaveNet
architecture was used to train raw audio models to
FOURTH QUARTER 2019

IEEE Circuits and Systems Magazine - Q4 2019

Table of Contents for the Digital Edition of IEEE Circuits and Systems Magazine - Q4 2019