Computational Intelligence - August 2017 - 32

former type. A recent survey by Bernardi et al. [24] reviews
both types in detail.
Methods that create a new description for a given image
from scratch can be said to have the main three component
steps mentioned in the introduction: (1) identification of
type and, optionally, location of objects and background/
scene in the image; (2) detection of attributes, relations and
activities involving objects from Step 1; and (3) generation
of a word string from a representation of the output from
Steps 1 and 2. For Step 1, some systems identify labelled
regions [25], [26], others directly map images to words [27].
For Step 2, systems determine object attributes [26], [28],
spatial relationships [29]-[31], activities [26], [30], etc. In
Step 3, systems differ in the amount of linguistic knowledge
they bring to bear on the generation process. Some view
the task as similar to a linearization problem where the aim
is to work out a likely string of words containing the labels,
relations and attributes from Steps 1 and 2 [27], [32]; others
employ templates to slot the latter into [29], [30], while still
others use grammar-based techniques to construct descriptions [33], [34].
Identifying the spatial relationships between pairs of objects
in images is an important part of image description, but is rarely addressed as a separate subtask in its own right. If a method
produces spatial prepositions, it tends to be as a side-effect of
the overall method [33], [35], or else relationships are not
between objects, but e.g. between objects and the scene [29].
An example of preposition selection as a separate subtask is
Elliott & Keller's work [30] who base the mapping on manually composed rules. Spatial relations also play a role in referring
expression generation [36], [37] where the problem is, however,
often simplified as a content selection task from known symbolic representations of objects and scene.
Mostly closely related to our work for Step 2 is work by
Ramisa et al., 2015 [38] and Hürlimann & Bos, 2016 [39]. In
both, various visual and verbal features computed for a given
image are used to predict prepositions to describe the spatial
relations between a pair of objects in the image. We have ourselves previously reported work on predicting English [40], [41]
and French [31], [42] prepositions.
IV. Image Data and Annotations

Our main data source is the VOC'08 corpus of images [43] in
which objects have been annotated with rectangular bounding
boxes and object class labels. We collected additional annotations for images (Section IV-C) which list, for each object pair,
a set of prepositions selected by human annotators as correctly
describing the spatial relationship between the objects.
A. Source Data Sets

VOC'08 9K: The data from the PASCAL VOC 2008 Shared
Task Competition (VOC'08) consists of 8,776 images and
20,739 objects in 20 object classes. In each image, every object
belonging to one of the 20 VOC'08 object classes is annotated
for class, bounding box, viewpoint, truncation, occlusion, and

32

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2017

identification difficulty [43], examples of all of which can be
seen in Figure 1.3 Of these annotations we use the following:
❏❏ class: aeroplane, bird, bicycle, boat, bottle, bus, car, cat, chair,
cow, dining table, dog, horse, motorbike, person, potted
plant, sheep, sofa, train, tv/monitor.
❏❏ bounding box: an axis-aligned box surrounding the extent of
the object visible in the image.
VOC'08 1K: Using Mechanical Turk, Rashtchian et al. [4] collected five descriptions each for 1,000 VOC'08 images selected
from the larger 9K set (see above) randomly but ensuring there
were 50 images from each VOC'08 class. Contributors had to
have high hit rates and pass a language competence test before
creating descriptions, leading to relatively high quality with few
grammatical or spelling mistakes. See Figure 1 for an example.
B. Spatial Relations for Annotation

In order to determine the set of spatial relations (SRs) to be
used by our annotators, we proceeded as follows. From the
VOC'08 1K data set we obtained a set of candidate prepositions
by parsing the 5,000 descriptions with the Stanford Parser version 3.5.24 with the PCFG model, extracting the nmod:prep
prepositional modifier relations, and manually removing the
non-spatial ones.This gave us a set of 38 English prepositions.
In order to obtain an analogous set of prepositions for
French, as a first step we asked two French native speakers to
compile the list of possible translations of the English prepositions, and to check these against 200 random example images
from our corpus. The full list for French had 21 prepositions
and these were reduced to a smaller set, on the basis of an earlier batch of annotations [42], by eliminating (i) prepositions
that were used fewer than three times by annotators (en haut
de, parmi), and (ii) those which co-occur with another preposition more than 60%5 of the times they occur in total (á
l'interieur de, en dessous de), in accordance with the general
sense of synonymity defined in Section II. We found this kind
of co--occurrence to be highly imbalanced, e.g. the likelihood
of seeing á l'interieur de given dans is 0.43, whereas the likelihood of seeing dans given á l'interieur de is 0.91. We take this as
justification for merging á l'interieur de into dans, rather than
the other way around. The whole process leaves a set of 17
French prepositions:
VF = {à côté de, á l'éxterieur de, au dessus de, au niveau de,
autour de, contre, dans, derrière, devant, en face de, en travers de, le
long de, loin de, par delà, près de, sous, sur}
As discussed in Section II, we make the domain-specific
assumption that there is a one-to-one correspondence between
prepositions and the SRs they denote. While our machine
learning task is SR detection, we ask annotators to annotate
our data with the corresponding prepositions (a more humanfriendly task).
3

Image adapted from: http://lear.inrialpes.fr/RecogWorkshop08/documents/-
everingham.pdf
http://nlp.stanford.edu/software/lex-parser.shtml#Download
5
This is a very high threshold and far above co-occurrence percentages for any other
preposition pairs.
4
http://lear.inrialpes.fr/RecogWorkshop08/documents/
Table of Contents for the Digital Edition of Computational Intelligence - August 2017