Signal Processing - July 2017 - 108

speaker identification/verification, which targets obtaining
devices, such as smartphones, wearable devices, and tablets.
the speaker's identity from speech signals
Furthermore, as these devices often have microphones, social
■ computational paralinguistics, which attempts to distill
media apps, and Internet connectivity, they can be considered
nonlinguistic information mainly concerning the speaker's
distributed sensors or entryways for speech collection and proshort-term states (e.g., emotions), medium-term states
cessing. Thus, the advance of Internet technologies and the
(e.g., health condition and attitude), and long-term traits
ubiquity of smart devices can drastically reduce the cost and
(e.g., personality, age, and gender) from spoken speech.
time associated with collecting and processing speech data.
A serious obstacle to the broad application of ASA is the
Cloud computing, or Internet-based computing, is expected
lack of sufficiently labeled data in terms of both quantity and
to provide an on-demand computing resource. Thus, it gives
quality. For example, many available coman opportunity to store, access, and analyze
putational paralinguistics corpora contain
the volume of speech data generated by the
A serious obstacle to the
only a few hours of audio data at most [3].
distributed devices mentioned previously.
broad application of ASA
Similarly for ASR, many of the world's lanCloud computing has been shown not only
is the lack of sufficiently
guages are in a low-resource setting, where
to minimize the costs associated with an
labeled data in terms of
the electronic speech resources and linguisever-increasing demand for greater compuboth quantity and quality.
tic expertise are lacking. According to a
tational resources but also to reduce the cost
2010 United Nations Educational, Scientifassociated with infrastructure maintenance
ic, and Cultural Organization report [4], approximately 2,500
and user access. Motivated by these advantages, most major
languages are in danger of becoming extinct. In this scenario,
speech technology providers have already shifted their primait is exceptionally difficult to obtain a large-scale amount of
ry research and application attention from embedded systems
transcribed speech data to perform reliable ASR.
to cloud computing platforms.
The requirement for large-scale labeled data is not new
in machine leaning. Prevailing paradigms are often conductGeneralized automatic speech analysis:
ed in a supervised manner, and a substantial increase in the
Problem statement and notation
amount of available training data usually brings encouraging
The aforementioned technologies provide great potential to
performance improvements [5]. Because of the advancement
generate and process a large amount of speech data. However,
of deep-learning technologies [6], [7], this need for data has
there are three main challenges-data sparsity, unreliability,
become more compelling than ever. Deep-learning models
and nonmatching (Figure 1)-that limit the dissemination of
are often designed with millions of parameters, and, if trained
these data in research and industry. Before formally defining
with insufficient amounts of data, are vulnerable to being
these challenges, we first overview the generalized mathemattrapped in a locally optimized minimum, resulting in overfitical problem statement and notation commonly used in both
ting to the training data [6]. When sufficiently trained, howASA and throughout the remainder of this article.
ever, deep models reach unprecedented levels of performance.
First, let us define a domain D = {X, P (X )} that comFor example, Amodei et al. [7] utilized approximately 12,000
prises a feature space X and a marginal probability distriand 9,000 h of speech data to model English and Mandarin
bution P(X), where X denotes a set of feature vectors, i.e.,
X = {x 1, f, x n} ! X; while P(X) indicates the distribution
ASR systems, respectively, by employing deep-learning models with more than 35 million trainable parameters, achieving
of X in X. In the case that each feature vector x consists of
a performance breakthrough that exceeds the capability of
d attributes, i.e., x = {x 1, f, x d}, X is a d-dimensional space.
even human perception. Sufficient and reliably labeled data,
The most commonly used feature space X for ASA is arguwhen available, provide the opportunity to train robust ASA
ably the Mel-frequency cepstral coefficients (MFCCs) that
models whose resulting recognition is largely invariant in the
are extracted via filtering a speech frame by a bank of nonlinface of the abundance of acoustic variations naturally present
ear bandpass filters (Mel filters) whose frequency response
in speech data.
is based on the cochlea of the human auditory system [8].
Other exemplary feature spaces include the i-vector repreOpportunities
sentation often used for speaker identification/verification [9],
Traditionally, tasks such as data collection and annotation have
and mixed brute force feature representations, such as the
been performed by small groups of experts in a laboratory setbroadly used ComParE feature set, which contains 6,373
ting. This conventional work paradigm is often tedious, time
static features (i.e., statistical functionals including mean
consuming, and costly. However, the ongoing information and
and variance) of low-level descriptor (LLD) contours (i.e.,
communication technologies revolution and related technoloMFCCs) often used in tasks such as recognition of emotion
gies, such as the Internet of Things (IoT) and cloud computing,
from speech [10].
are providing us with opportunities to exploit larger amounts of
We further define a generic ASA task F = " Y, f ($) , that
speech data in more effective ways than ever before.
consists of a label space Y and a predictive function f (·) (or
The IoT, as a global infrastructure of the information socia conditional distribution P (Y | X )). The goal of this task is to
ety, is expected to offer advanced services (i.e., data collection)
build an effective and robust predictive function f (·) that is
by interconnecting a wide variety of contemporary recording
capable of learning transformation rules from the feature space
■

108

IEEE SIGNAL PROCESSING MAGAZINE

|

July 2017

|



Table of Contents for the Digital Edition of Signal Processing - July 2017

Signal Processing - July 2017 - Cover1
Signal Processing - July 2017 - Cover2
Signal Processing - July 2017 - 1
Signal Processing - July 2017 - 2
Signal Processing - July 2017 - 3
Signal Processing - July 2017 - 4
Signal Processing - July 2017 - 5
Signal Processing - July 2017 - 6
Signal Processing - July 2017 - 7
Signal Processing - July 2017 - 8
Signal Processing - July 2017 - 9
Signal Processing - July 2017 - 10
Signal Processing - July 2017 - 11
Signal Processing - July 2017 - 12
Signal Processing - July 2017 - 13
Signal Processing - July 2017 - 14
Signal Processing - July 2017 - 15
Signal Processing - July 2017 - 16
Signal Processing - July 2017 - 17
Signal Processing - July 2017 - 18
Signal Processing - July 2017 - 19
Signal Processing - July 2017 - 20
Signal Processing - July 2017 - 21
Signal Processing - July 2017 - 22
Signal Processing - July 2017 - 23
Signal Processing - July 2017 - 24
Signal Processing - July 2017 - 25
Signal Processing - July 2017 - 26
Signal Processing - July 2017 - 27
Signal Processing - July 2017 - 28
Signal Processing - July 2017 - 29
Signal Processing - July 2017 - 30
Signal Processing - July 2017 - 31
Signal Processing - July 2017 - 32
Signal Processing - July 2017 - 33
Signal Processing - July 2017 - 34
Signal Processing - July 2017 - 35
Signal Processing - July 2017 - 36
Signal Processing - July 2017 - 37
Signal Processing - July 2017 - 38
Signal Processing - July 2017 - 39
Signal Processing - July 2017 - 40
Signal Processing - July 2017 - 41
Signal Processing - July 2017 - 42
Signal Processing - July 2017 - 43
Signal Processing - July 2017 - 44
Signal Processing - July 2017 - 45
Signal Processing - July 2017 - 46
Signal Processing - July 2017 - 47
Signal Processing - July 2017 - 48
Signal Processing - July 2017 - 49
Signal Processing - July 2017 - 50
Signal Processing - July 2017 - 51
Signal Processing - July 2017 - 52
Signal Processing - July 2017 - 53
Signal Processing - July 2017 - 54
Signal Processing - July 2017 - 55
Signal Processing - July 2017 - 56
Signal Processing - July 2017 - 57
Signal Processing - July 2017 - 58
Signal Processing - July 2017 - 59
Signal Processing - July 2017 - 60
Signal Processing - July 2017 - 61
Signal Processing - July 2017 - 62
Signal Processing - July 2017 - 63
Signal Processing - July 2017 - 64
Signal Processing - July 2017 - 65
Signal Processing - July 2017 - 66
Signal Processing - July 2017 - 67
Signal Processing - July 2017 - 68
Signal Processing - July 2017 - 69
Signal Processing - July 2017 - 70
Signal Processing - July 2017 - 71
Signal Processing - July 2017 - 72
Signal Processing - July 2017 - 73
Signal Processing - July 2017 - 74
Signal Processing - July 2017 - 75
Signal Processing - July 2017 - 76
Signal Processing - July 2017 - 77
Signal Processing - July 2017 - 78
Signal Processing - July 2017 - 79
Signal Processing - July 2017 - 80
Signal Processing - July 2017 - 81
Signal Processing - July 2017 - 82
Signal Processing - July 2017 - 83
Signal Processing - July 2017 - 84
Signal Processing - July 2017 - 85
Signal Processing - July 2017 - 86
Signal Processing - July 2017 - 87
Signal Processing - July 2017 - 88
Signal Processing - July 2017 - 89
Signal Processing - July 2017 - 90
Signal Processing - July 2017 - 91
Signal Processing - July 2017 - 92
Signal Processing - July 2017 - 93
Signal Processing - July 2017 - 94
Signal Processing - July 2017 - 95
Signal Processing - July 2017 - 96
Signal Processing - July 2017 - 97
Signal Processing - July 2017 - 98
Signal Processing - July 2017 - 99
Signal Processing - July 2017 - 100
Signal Processing - July 2017 - 101
Signal Processing - July 2017 - 102
Signal Processing - July 2017 - 103
Signal Processing - July 2017 - 104
Signal Processing - July 2017 - 105
Signal Processing - July 2017 - 106
Signal Processing - July 2017 - 107
Signal Processing - July 2017 - 108
Signal Processing - July 2017 - 109
Signal Processing - July 2017 - 110
Signal Processing - July 2017 - 111
Signal Processing - July 2017 - 112
Signal Processing - July 2017 - 113
Signal Processing - July 2017 - 114
Signal Processing - July 2017 - 115
Signal Processing - July 2017 - 116
Signal Processing - July 2017 - 117
Signal Processing - July 2017 - 118
Signal Processing - July 2017 - 119
Signal Processing - July 2017 - 120
Signal Processing - July 2017 - 121
Signal Processing - July 2017 - 122
Signal Processing - July 2017 - 123
Signal Processing - July 2017 - 124
Signal Processing - July 2017 - 125
Signal Processing - July 2017 - 126
Signal Processing - July 2017 - 127
Signal Processing - July 2017 - 128
Signal Processing - July 2017 - 129
Signal Processing - July 2017 - 130
Signal Processing - July 2017 - 131
Signal Processing - July 2017 - 132
Signal Processing - July 2017 - 133
Signal Processing - July 2017 - 134
Signal Processing - July 2017 - 135
Signal Processing - July 2017 - 136
Signal Processing - July 2017 - 137
Signal Processing - July 2017 - 138
Signal Processing - July 2017 - 139
Signal Processing - July 2017 - 140
Signal Processing - July 2017 - 141
Signal Processing - July 2017 - 142
Signal Processing - July 2017 - 143
Signal Processing - July 2017 - 144
Signal Processing - July 2017 - 145
Signal Processing - July 2017 - 146
Signal Processing - July 2017 - 147
Signal Processing - July 2017 - 148
Signal Processing - July 2017 - 149
Signal Processing - July 2017 - 150
Signal Processing - July 2017 - 151
Signal Processing - July 2017 - 152
Signal Processing - July 2017 - 153
Signal Processing - July 2017 - 154
Signal Processing - July 2017 - 155
Signal Processing - July 2017 - 156
Signal Processing - July 2017 - 157
Signal Processing - July 2017 - 158
Signal Processing - July 2017 - 159
Signal Processing - July 2017 - 160
Signal Processing - July 2017 - 161
Signal Processing - July 2017 - 162
Signal Processing - July 2017 - 163
Signal Processing - July 2017 - 164
Signal Processing - July 2017 - 165
Signal Processing - July 2017 - 166
Signal Processing - July 2017 - 167
Signal Processing - July 2017 - 168
Signal Processing - July 2017 - 169
Signal Processing - July 2017 - 170
Signal Processing - July 2017 - 171
Signal Processing - July 2017 - 172
Signal Processing - July 2017 - 173
Signal Processing - July 2017 - 174
Signal Processing - July 2017 - 175
Signal Processing - July 2017 - 176
Signal Processing - July 2017 - 177
Signal Processing - July 2017 - 178
Signal Processing - July 2017 - 179
Signal Processing - July 2017 - 180
Signal Processing - July 2017 - 181
Signal Processing - July 2017 - 182
Signal Processing - July 2017 - 183
Signal Processing - July 2017 - 184
Signal Processing - July 2017 - 185
Signal Processing - July 2017 - 186
Signal Processing - July 2017 - 187
Signal Processing - July 2017 - 188
Signal Processing - July 2017 - 189
Signal Processing - July 2017 - 190
Signal Processing - July 2017 - 191
Signal Processing - July 2017 - 192
Signal Processing - July 2017 - 193
Signal Processing - July 2017 - 194
Signal Processing - July 2017 - 195
Signal Processing - July 2017 - 196
Signal Processing - July 2017 - Cover3
Signal Processing - July 2017 - Cover4
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_201809
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_201807
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_201805
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_201803
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_201801
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1117
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0917
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0717
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0517
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0317
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0117
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1116
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0916
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0716
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0516
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0316
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0116
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1115
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0915
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0715
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0515
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0315
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0115
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1114
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0914
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0714
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0514
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0314
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0114
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1113
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0913
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0713
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0513
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0313
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0113
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1112
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0912
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0712
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0512
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0312
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0112
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1111
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0911
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0711
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0511
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0311
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0111
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1110
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0910
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0710
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0510
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0310
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0110
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1109
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0909
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0709
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0509
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0309
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0109
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1108
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0908
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0708
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0508
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0308
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0108
https://www.nxtbookmedia.com