IEEE Robotics & Automation Magazine - December 2015 - 149

computer vision proposals for deTable 1. The task evolution.
tecting and recognizing objects in
images obtained from Flickr. Despite
Second Third
Fourth Fifth
Edition Edition Edition Edition Edition
the intrinsic relationship between
Monocular images
object recognition and robotics, the
use of generalist images prevents us
from using PASCAL VOC proposals
to solve the problem of robot localPoint clouds
ization. The ImageNET Large Scale
Semantic annotations X
Visual Recognition Challenge (ILSPose annotations
VRC) [12] can be seen as the natural
successor of PASCAL VOC, which
Objective Two tasks
ended in 2012.
Unknown classes
The Reconstruction Meets RecogKidnappings
nition Challenge (RMRC) [15] startObject detection
ed in 2013 and is similar to the Robot
Vision challenge. First, this challenge
evaluates proposals for two robotic problems as segmenta- edition of the challenge [19] included two different cues: visual
tion and detection. Moreover, these tasks are presented for information and depth information. In this edition, the depth
indoor environments that have been imaged using RGB-D information was provided in the form of depth images. Finally,
sensors (using images from the New York University [8] data the fifth edition [20] included unprocessed 3-D information in
set). The main difference with respect to the Robot Vision the form of point cloud files (PCD format [21]).
task is the annotation scheme: RGB-D images are labeled at
With respect to the objectives of the competition, two
pixel/point level with object categories. In addition, the main tasks have been proposed since the first edition. Both
RMRC RGB-D images were not recorded using a temporal tasks focus on visual
continuity, which is an important issue to keep in mind when place recognition, but
The organizers proposed a
trying to solve a localization task.
they differ in the source
We can also find an ongoing challenge proposal with a of information. For the
baseline method for both
strong relationship with the Robot Vision task. This is the first task, participants
Large-Scale Scene Understanding Challenge (LSUN) (http:// have to provide informathe feature extraction and, which holds a scene classification tion about the location of
task. In this task, perspective images should be classified with the robot separately for
the classification steps.
the scene category using ten different options. Some of the each test image. On the
scene categories used in the LSUN challenge have been al- other hand, in the second
ready used in the Robot Vision task, like the conference task, the temporal contiroom or the kitchen.
nuity of the sequence can be used to classify images. When
presented with a test image, participants can rely on the inTask Evolution
formation obtained using the previous images, making this
Despite the fact that the robot vision challenge was initially task closer to real-world robotic localization scenarios. The
planned as a visual place recognition competition, other addi- fifth edition of the challenge also introduced an object recogtional tasks have been included since its birth. Moreover, the nition task. Visual place recognition and object recognition
information provided to the participants has also changed can be considered as two subproblems of semantic localizafrom one edition to the other. A summary of this evolution tion, where each location is described in terms of its semancan be seen in Table 1.
tic contents.
The first edition of the competition [16] included room
annotations, but also poses annotations. Concretely, each Data Sets
image was annotated with two different types of information: Since 2009, several data sets have been created for the Robot
1) the label of the room where the image was acquired from Vision competition. The first data set used in the challenge
and 2) the specific < x, y, i > pose of the robot. Although was the KTH-IDOL2 database [22]. This data set was acpose annotations were included in the training data, partici- quired using a mobile robot platform in the indoor environpants were encouraged not to use this information in their ment of the Computer Vision and Active Perception
final proposals.
Laboratory (CVAP) at the Royal Institute of Technology
In the second and third editions [17], [18], pose annota- (KTH) in Stockholm, Sweden. Each training image was antions were removed and monocular images were replaced by notated with the topological location of the robot and its
stereo ones, allowing participants to exploit the three dimen- pose < x, y, i > . As previously mentioned, although the pose
sional (3-D) configuration of the environment. The fourth information was provided in the training data, participants






