Computational Intelligence - August 2017 - 26

results are slightly better than ours for indomain training, our models are able to transfer weights and train a reasonable policy for
SfxH from SfxR alone.
The most relevant comparison to our
model in terms of multi-task learning is Wen
et al. [12]. The authors achieved a maximum BLEU score of
0.48 for domain transferred data, only 52% of our 0.92, but
they worked on the TV and laptop domain, thus not allowing a
direct comparison between results. The authors reported a
semantic error of 0.04. In contrast to our work, which transfers
learnt models across domains and natural data, Wen et al.'s
experiments are based on artificially generated data. These
often do not display the same variety and complexity as naturally occurring data, which arguably shows that our AMRbased inputs allow for more complex lexical-syntactic
constructions to be learnt.

Learning from out-of-domain prior knowledge
achieves better results than learning from a single
domain only.
prior achieves better results than learning from a single domain
only. This seems to suggest that the pre-learnt weights from a
similar dataset are very valuable for the new domains. This is
particularly interesting for domains for which little training
data is available. We can also see from Table 2 that SfxH
achieves very decent performance without any in-domain data at
all, but based purely on training from SfxR. This is a remarkable result because it means that we can generate inputs for a
new domain based on no annotated training data at all. These
results clearly show the significance of using a common input
representation across domains. Abstracting away from particular
slots, such as "Kirin restaurant" or "Pacific Heights area", we
can reuse the lexical-syntactic patterns learnt in one domain in
others. Table 3 shows examples of the abstract patterns and
realizations that were transferred across domains.
Comparing with related work, Yu et al. [72] reported a
BLEU-1 score of 0.59 and a BLEU-2 score of 0.39 for RefCoco. This is substantially lower than our scores and might
reflect the increased difficulty in the scenario in Yu et al. [72]
who generated referring expressions directly from images. In
terms of navigation, most related work on the Sail data [24]-
[26] focuses on generating action sequences rather than the
actual route instructions. For spoken dialogue, Wen et al. [34]
achieve BLEU-4 scores 0.73 and 0.83 for SfxR and SfxH,
respectively, with a semantic error rate of 0.046. While these

TABLE 3 Examples of lexical-syntactic patterns learnt in one
domain then used in another.
IMPERATIVE CLAUSE CONSTRUCTION (WITH RELATIVE CLAUSE)
(e1 / event :arg0 (y / you) :arg1 (b1 / obj :mod
property :location (w / obj :op1 (o1 / on [in]) ))
:mode imperative)
Give: "CLICK THE RED BUTTON (THAT IS) ON THE WALL."
SfxR: "TRY CHINESE RESTAURANT KIRIN IN THE PACIFIC HEIGHTS
AREA."
TRANSITIVE CLAUSE CONSTRUCTION
(e1 / event :arg0 (b1 / obj :mod property) :arg1
(b2 / obj :mod property))
Gre: "THE YELLOW SPHERE (THAT IS) TOUCHING THE BLUE BOX."
SfxR: "SOURCE RESTAURANT SERVES ITALIAN FOOD."
COMPLEX NOUN PHRASE AND SPATIAL RELATION AND
T
- EMPORAL ADVERB
(e1 / obj :time (n / now) :domain (b1 / obj :mod
property :mod property :location (l / obj :op1 (n /
on [near, by]))) )
Gre: "NOW, THE BLUE CIRCLE ON THE GREEN SQUARE."
Gre: "NOW, THE GREEN BUTTON BY THE WINDOW."
SfxR: "NOW, AN INDIAN RESTAURANT NEAR PACIFIC HEIGHTS."

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2017

C. Results: Subjective Evaluation

To also assess the subjective quality of generated outputs, we
recruited 204 human judges from the CrowdFlower2 and
AMT3 crowdsourcing platforms to assign subjective ratings to
generated outputs. All judges were self-declared native or fluent speakers of English and rated altogether 3425 utterances
sampled randomly from a pool of 120 candidates per model.
To allow for a comparison with related work, we follow previous authors in asking judges to rate the naturalness of utterances. They were asked to agree with the statement "The utterance
is natural (i.e. could have been produced by a human)." on a scale of
1-5, where 1 is the worst score and 5 is the best. For each
dataset, we also collected an equal number of ratings for the
original human utterances to provide an upper bound for the
comparison of our systems. The results are shown in the rightmost column in Table 2. Medians are shown alongside averages
in parentheses. In a statistical analysis, we decided to focus on
the difference between out-of-domain training and training
with prior knowledge to see what effects can be gained by
having prior weights for training. Symbol * indicates statistical
significance at p 1 0.05 according to a 2-tailed Wilcoxon
signed rank test. From the analysis we can see that two of the
six comparisons are statistically significant, namely Give with
Gre prior and Sail with Give prior-both in the navigation
domain. None of the other differences are significant. We
believe that these results are encouraging in that we did not
expect all differences to be statistically significant. For example,
while no significance between in-domain and -out-of-domain
training or with prior knowledge does of course not indicate
equivalent policies or performance, it means at least that the
transfer of training data from one domain to another does not
lead to a significant deterioration of generated outputs.
The overall results correspond to the objective results. While
most of the subjective ratings are not as good as those received
2
3

https://www.crowdflower.com/
https://www.mturk.com

https://www.crowdflower.com/ https://www.mturk.com

Table of Contents for the Digital Edition of Computational Intelligence - August 2017