Signal Processing - January 2017 - 73

and generates a response. For voice queries, the first step is to
recognize the spoken words [9], [10]. The LU component takes
the speech transcription (or the text input if the user types) and
performs a semantic analysis to determine the underlying user
intent [11]-[14], [17]. The user's intent could be related to information search, QA, chitchat, or task-oriented specialist dialogs.
Because PDAs support multidomain and multiturn interactions,
multiple alternate semantic analyses (typically at least one for
each domain) are generated in parallel for late binding on the
user's intent [15]. These semantic analyses are sent to the dialog
state update component, which includes slot carryover (SCO)
[38], flexible item selection from a list [39], knowledge fetch
from the service providers, and dialog hypothesis generation
[15], [16]. Note that in this framework, we consider chitchat,
QA, and web search as an additional set of LU domains. All the
dialog hypotheses are ranked by the hypothesis ranking (HR)
module. The top hypothesis is selected by the hypothesis selection (HS) module by taking the provider responses (i.e., knowledge results) into account [18]. The top dialog hypothesis (along
with the ranked dialog hypothesis distribution) is the input to
the dialog policy component, which determines the system
response based on the scenario and business logic constraints.
Typically, for voice input, the agent speaks the natural language
response via the TTS synthesis engine [19].
The reactive assistance behavior is governed by (3). The goal
of the reactive agent is to provide the best system response Rt
to a given user query, Q. The system response, R, consists of a
dialog act, which includes system action (e.g., information to be
displayed, question to be asked, or action to be executed), natural
language prompt, and a card in which the response is displayed
Rt = argmax " P (R Q, B A 1, f, B A N, Br ) ,,

(3)

R

where B A 1 denotes the current belief about the dialog state of
the answer A1 (e.g., weather, alarm, places, reminder, sports,
etc.) after processing query Q, and Br shows the system's
belief about the state of the interaction across all answers for
the current session. In practice, it is hard to solve (3). Instead, a
suboptimal solution can be achieved with the assumption that,
given the query Q and the beliefs for the dialog states of the
individual answers A1 through AN , the per answer response is
conditionally independent
Rt = argmax {P (R Q, B A 1, Br ), f, P (R Q, B A N, Br )} .

(4)

R ! R 1,..., R N

Here, P (R Q, B A i, Br ) denotes the probability the system
assigns to response (R) generated by answer A i, given the
answer's belief about its dialog state and the system's belief,
Br . This formulation allows the individual answers to manage
their own dialog state and generate their own responses in parallel. Therefore, it is possible to scale to many domains and
answers without substantially increasing the overall system
response latency. The HR component operates as a metalayer,
arbitrating between different answer responses, given its belief
(i.e., Br ) about the state of the overall interaction [15]. Next, we
will briefly describe the key components in Figure 4.

Speech recognition
The speech recognition component maps the human speech represented in acoustic signals to a sequence of words represented in
text. Let X denote the acoustic observations in the form of feature vector sequence and Q be the corresponding word sequence
(i.e., query). The speech recognition decoder chooses the word
t , with the maximum a posteriori probability accordsequence, Q
ing to the fundamental equation of speech recognition [9]:
t = argmax P (X Q) P (Q) ,
Q
Q

(5)

where P (X Q) and P (Q) are the probabilities generated by
the acoustic and language models, respectively. Traditionally,
speech recognition systems are trained to optimize the lexical
form. However, displaying the grammatically and semantically
correct version of the output (i.e., display form) has become an
important requirement for PDAs, because it makes it easy for
the user to infer whether the system correctly heard and recognized the spoken query. For example, the following two speech
recognition outputs are lexically equivalent:
■ how is the traffic on u s one oh one (lexical form)
■ how is the traffic on US 101 (display form).
However, proper tokenization in the second hypothesis provides a valuable hint that the agent understood what the user
meant. Typically, the tokenization is applied as a separate postprocessing module after running the speech recognition decoder.
In recent years, advances in deep learning and its application to speech recognition have dramatically improved stateof-the-art speech recognition accuracy [9], [10], [20], [21].
Deep learning allows computational models that are composed
of multiple processing layers to learn representations of data
with multiple levels of abstraction. These advances played a
key role in the adoption of PDAs by a large number of users
making it a mainstream product.

LU
The problem of LU for PDAs is a multidomain, multiturn,
contextual query understanding [17], [22]-[25], subject to the
constraints of the back-end data sources and the applications
in terms of the filters they support and actions they execute.
These constraints are represented in a schema. In practice,
while LU semantically parses and analyzes the query, it does
not do so according to a natural LU theory [26]; rather, parsing and analysis are done according to the specific user
experience and scenarios to be supported. This is where the
semantic schema comes into play, as it captures the constraints of the back-end knowledge sources and service APIs,
while allowing free form of natural language expression to
represent different user intents in an unambiguous manner.
There are two main approaches to LU: rule based and
machine learned [11], [32]. The rule-based approach is about
hand authoring a set of rules to semantically parse the query
[27]. It can also be used for addressing the errors and disfluencies introduced by a speech recognizer [28], [49].
State-of-the-art systems use machine-learned models for LU
[12], [17], [23], [24]. In a commonly used LU architecture,

IEEE Signal Processing Magazine

|

January 2017

|

73

Table of Contents for the Digital Edition of Signal Processing - January 2017

Signal Processing - January 2017 - Cover1
Signal Processing - January 2017 - Cover2
Signal Processing - January 2017 - 1
Signal Processing - January 2017 - 2
Signal Processing - January 2017 - 3
Signal Processing - January 2017 - 4
Signal Processing - January 2017 - 5
Signal Processing - January 2017 - 6
Signal Processing - January 2017 - 7
Signal Processing - January 2017 - 8
Signal Processing - January 2017 - 9
Signal Processing - January 2017 - 10
Signal Processing - January 2017 - 11
Signal Processing - January 2017 - 12
Signal Processing - January 2017 - 13
Signal Processing - January 2017 - 14
Signal Processing - January 2017 - 15
Signal Processing - January 2017 - 16
Signal Processing - January 2017 - 17
Signal Processing - January 2017 - 18
Signal Processing - January 2017 - 19
Signal Processing - January 2017 - 20
Signal Processing - January 2017 - 21
Signal Processing - January 2017 - 22
Signal Processing - January 2017 - 23
Signal Processing - January 2017 - 24
Signal Processing - January 2017 - 25
Signal Processing - January 2017 - 26
Signal Processing - January 2017 - 27
Signal Processing - January 2017 - 28
Signal Processing - January 2017 - 29
Signal Processing - January 2017 - 30
Signal Processing - January 2017 - 31
Signal Processing - January 2017 - 32
Signal Processing - January 2017 - 33
Signal Processing - January 2017 - 34
Signal Processing - January 2017 - 35
Signal Processing - January 2017 - 36
Signal Processing - January 2017 - 37
Signal Processing - January 2017 - 38
Signal Processing - January 2017 - 39
Signal Processing - January 2017 - 40
Signal Processing - January 2017 - 41
Signal Processing - January 2017 - 42
Signal Processing - January 2017 - 43
Signal Processing - January 2017 - 44
Signal Processing - January 2017 - 45
Signal Processing - January 2017 - 46
Signal Processing - January 2017 - 47
Signal Processing - January 2017 - 48
Signal Processing - January 2017 - 49
Signal Processing - January 2017 - 50
Signal Processing - January 2017 - 51
Signal Processing - January 2017 - 52
Signal Processing - January 2017 - 53
Signal Processing - January 2017 - 54
Signal Processing - January 2017 - 55
Signal Processing - January 2017 - 56
Signal Processing - January 2017 - 57
Signal Processing - January 2017 - 58
Signal Processing - January 2017 - 59
Signal Processing - January 2017 - 60
Signal Processing - January 2017 - 61
Signal Processing - January 2017 - 62
Signal Processing - January 2017 - 63
Signal Processing - January 2017 - 64
Signal Processing - January 2017 - 65
Signal Processing - January 2017 - 66
Signal Processing - January 2017 - 67
Signal Processing - January 2017 - 68
Signal Processing - January 2017 - 69
Signal Processing - January 2017 - 70
Signal Processing - January 2017 - 71
Signal Processing - January 2017 - 72
Signal Processing - January 2017 - 73
Signal Processing - January 2017 - 74
Signal Processing - January 2017 - 75
Signal Processing - January 2017 - 76
Signal Processing - January 2017 - 77
Signal Processing - January 2017 - 78
Signal Processing - January 2017 - 79
Signal Processing - January 2017 - 80
Signal Processing - January 2017 - 81
Signal Processing - January 2017 - 82
Signal Processing - January 2017 - 83
Signal Processing - January 2017 - 84
Signal Processing - January 2017 - 85
Signal Processing - January 2017 - 86
Signal Processing - January 2017 - 87
Signal Processing - January 2017 - 88
Signal Processing - January 2017 - 89
Signal Processing - January 2017 - 90
Signal Processing - January 2017 - 91
Signal Processing - January 2017 - 92
Signal Processing - January 2017 - 93
Signal Processing - January 2017 - 94
Signal Processing - January 2017 - 95
Signal Processing - January 2017 - 96
Signal Processing - January 2017 - 97
Signal Processing - January 2017 - 98
Signal Processing - January 2017 - 99
Signal Processing - January 2017 - 100
Signal Processing - January 2017 - 101
Signal Processing - January 2017 - 102
Signal Processing - January 2017 - 103
Signal Processing - January 2017 - 104
Signal Processing - January 2017 - 105
Signal Processing - January 2017 - 106
Signal Processing - January 2017 - 107
Signal Processing - January 2017 - 108
Signal Processing - January 2017 - 109
Signal Processing - January 2017 - 110
Signal Processing - January 2017 - 111
Signal Processing - January 2017 - 112
Signal Processing - January 2017 - 113
Signal Processing - January 2017 - 114
Signal Processing - January 2017 - 115
Signal Processing - January 2017 - 116
Signal Processing - January 2017 - Cover3
Signal Processing - January 2017 - Cover4
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_201809
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_201807
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_201805
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_201803
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_201801
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1117
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0917
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0717
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0517
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0317
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0117
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1116
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0916
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0716
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0516
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0316
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0116
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1115
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0915
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0715
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0515
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0315
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0115
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1114
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0914
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0714
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0514
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0314
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0114
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1113
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0913
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0713
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0513
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0313
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0113
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1112
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0912
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0712
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0512
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0312
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0112
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1111
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0911
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0711
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0511
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0311
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0111
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1110
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0910
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0710
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0510
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0310
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0110
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1109
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0909
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0709
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0509
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0309
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0109
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_1108
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0908
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0708
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0508
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0308
https://www.nxtbook.com/nxtbooks/ieee/signalprocessing_0108
https://www.nxtbookmedia.com