IEEE Circuits and Systems Magazine - Q4 2019 - 33

Since GANs have been widely applied for image generation
and synthesis, it makes sense to consider them
for speech synthesis.
generate globally coherent, high resolution samples from datasets with high variability [108]. To
address this problem, label conditioning is employed in the AC-GAN that results in high resolution image samples exhibiting global coherence
[104], [120], [121]. Applying a new quantitative metric proposed for image discriminability, it is shown
that the samples generated from the AC-GAN are
more discriminable than previous models which
create lower resolution images and perform a naive resize operation. The generated samples also
give a comparable diversity to the training data.
■ CycleGAN [122]: Analogous to language translation, image-to-image translation is defined as a
problem of translating an image from one representation of a given scene to another, e.g., gray-scale
to color, image to semantic labels and edge-map to
photograph [122]. For image-to-image translation
applications, these can be realized with the CycleGAN by learning the mapping between an input
image and an output image, no matter whether the
correlation information of the training samples is
given or not. Comparisons against previous methods demonstrate that the CycleGAN can outperform the other approaches in quantitative experiments. The CycleGAN can be applied for image
processing problems such as season transfer, collection style transfer and photo generation from
paintings. A related work is [110].
■ WaveGAN [123]: This is an early implementation
of GANs for audio synthesis. The WaveGAN is proposed for raw audio synthesis in an unsupervised
setting. Testing results suggest that WaveGAN is
able to capture semantically-meaningful modes
for small-vocabulary speech (such as the SC09 database analyzed in WaveGAN's work).
C. Applications in Audio Generation
Since GANs have been widely applied for image generation and synthesis, it makes sense to consider them for
speech synthesis. In [123], the authors applied GANs to
synthesize raw audio by introducing the WaveGAN, a
time-domain approach, and the SpecGAN, a frequencydomain approach. The WaveGAN was built based on the
DCGAN but modifying the transposed convolution operation to widen the receptive field for time-domain signals. Other hyper-parameters remained the same as the
FOURTH QUARTER 2019

DCGAN. Experiments showed that the WaveGAN can
produce intelligible words and even audio from other
domains like bird vocalizations or piano. The subjective
evaluation demonstrated a preference for the generated
samples from the WaveGAN.
In [109], an emerging topic of utilizing GANs to synthesize speech for attacking automatic speaker recognition systems was investigated. Various state-of-the-art
GANs were examined by fooling a CNN-based text-independent speaker recognizers with generated Mel-spectrograms. For targeted attacks, a modified objective
function was proposed to access to universal properties
of speech. By applying the WGAN-GP with the modified
mixed loss function, it was able to differentiate between
real samples from a target speaker and real speech samples from other speakers. Resultant adversarial examples performed well for targeted and untargeted attacks
to the speaker recognition system.
A CycleGAN was used in [124] with six layers of fully
connected neural networks as the generator and the discriminator. The feature used for training was a mixture
of a Mel-spectrogram and the first and second derivatives. A WaveNet vocoder was trained to form the speech
waveform. Perceptual evaluations suggested that an effective enhancement and an improvement of the perceptual cleanliness were achieved with the help of GANbased models. The authors also investigated the quality
of the generated speech with publicly available data.
Recently, the GAN based architectures have also
been explored and investigated for the music generation
problem. In [125], the generator was composed using
CNNs for yielding melody in the symbolic domain, while
the discriminator was trained to learn the distributions
of melodies. This proposed GAN, named with MidiNet,
can generate melodies from scratch or by conditioning on the melody of previous bars. In another work of
[126], three models were proposed for symbolic multitrack music generation under the framework called
MuseGAN.8 These models were trained on a rock music
dataset and used for piano-rolls generation (such as the
bass, drums and strings). Interestingly, the models can
be extended to generate additional tracks to accompany
a given specific track composed by human.
Another emerging application of GANs is data augmentation for speech/acoustic signal processing. In [127],
8

https://salu133445.github.io/musegan/
IEEE CIRCUITS AND SYSTEMS MAGAZINE

IEEE Circuits and Systems Magazine - Q4 2019

Table of Contents for the Digital Edition of IEEE Circuits and Systems Magazine - Q4 2019