IEEE Systems, Man and Cybernetics Magazine - April 2023 - 29

conversations occur and the short interaction duration.
Moreover, only two people interact with the robot, and the
conversation is only between the robot and the humans.
Addressee Recognition in Visual Scenes With Utterances
(ARVSU) [2] is a mock dataset created using the preexisting
GazeFollow dataset [28] with large variations in
visual scenes due to further annotations. The dataset contains
the original images, cropped speaker images with
the head location annotated, gaze, and additional annotation
of utterance in tests and to whom an utterance is
addressed. The utterance story was generated artificially
by a human from images and original gaze information.
However, this dataset does not contain natural human-
human-robot interactions and is not suitable for exploring
AD in a realistic situation.
In summary, these datasets have two things in common:
they are not recorded in human-to-human and
human-to-robot settings and are not spatiotemporally
annotated. Furthermore, they are recorded with few participants
and inside a meeting room. For this reason, they
are not useful for conducting a study that helps to endow
robots with the capacity to interact with humans naturally.
This work sought to introduce a new dataset that could
overcome these problems by proposing a spatiotemporal
annotated dataset recorded in mixed human-to-human
and human-to-robot settings.
AD
One of the earliest approaches for AD in multiparty interaction
was proposed by Traum et al. [11]. They proposed a
rule-based approach leveraging previous utterances, current
utterances, immediate previous speaker, and current
speaker to detect the addressee. The accuracy varies
between 65% and 100% on the Mission Rehearsal dataset,
depending upon the DA. The algorithm reports an accuracy
of only 36% on the AMI dataset [26]. Following this, op
den Akker and Traum [13] improved accuracy by 65% by
incorporating gaze as one of the foundations of the rules
in the algorithm. In this work, they also explored gaze as
the only feature and reported an accuracy of 57%. The
proposed rule was as follows: " If the speaker looks at a
participant for more than 80% of the utterance duration,
the addressee is that participant; otherwise, the utterance
is addressed to the whole group. " However, their
work fails to predict the addressee when the speaker
talks to another with short utterances without looking,
and it is too specific.
Other works have tried to address AD with a statistical
approach. Jovanovic et al. [23] proposed a Bayesian network-based
framework trained on the M4 multimodal, multiparty
corpus [23] and achieved an accuracy of 81.05%.
They employed different modalities, such as current utterance,
previous utterance, speaker, topic of discussion,
gaze, and several metafeatures. Adopting the algorithm
presented in [12], op den Akker and Traum [13] reported
an accuracy of 62% on the AMI corpus. op den Akker and
op den Akker [29] present a logistic regression tree-based
model to answer whether a participant is an addressee or
not and attained an accuracy of 92% on the AMI corpus.
However, since the classifier depends on the addressee's
position, it is difficult to extend for many participants.
Baba et al. [30] introduced a support vector machine-based
binary classification model that predicts if the utterance is
directed to an agent or a human using the head orientation,
acoustic features, and text as input features. For this
purpose, they proposed a human-human-agent triadic
conversation collected through WoZ experiments and
reported 80.28% accuracy.
The problem with statistical and rule-based approaches
is that they depend on specific tasks or settings or do
not generalize to other scenarios and situations. To overcome
this, Malik et al. [31] proposed a model based on
generic features to predict the addressee in data sets with
varying participants using a machine learning algorithm.
In their following work, Malik et al. [32] proposed models
trained using different machine learning and deep learning
algorithms. Their work improves existing baseline
accuracies for addressee prediction on two datasets.
However, these two works did not propose a deep learning
framework to learn complex patterns from the dataset.
Instead, they employed different machine learning
classification algorithms and simple neural network classifiers.
The only approach that employed deep learning for
AD was proposed in [2]. The researchers use a convolutional
neural network [19] to detect the addressee from
visual scenes. They train their network by extending the
GazeFollow dataset [28] through generating fake utterances.
For utterance understanding, a recurrent neural network
(long short-term memory) is used and attained 62.5%
accuracy. However, in their model, AD is performed
through a third-party angle, and their model uses a dataset
that does not represent a real-world scenario. In this
work, we propose a deep learning framework using the
newly built dataset that leverages long- and short-term
audiovisual features to predict AD.
Audiovisual Fusion Approaches
Recently, several works successfully jointly modeled
audiovisual features for different problems. Hu et al. [33]
and Ren et al. [34] use audio and visual signals for automatic
speaker naming. Hu et al. [33] assume nonoverlapping
speech, while Ren et al. [34] relax that assumption.
Nock et al. [14] jointly utilized audiovisual information for
lipreading. Roth et al. [16] proposed an audiovisual spatiotemporal
annotated dataset and two-stream-based baseline
audiovisual frameworks to predict active speakers
from video. Since then, numerous works have been proposed
for active speaker detection using the proposed
dataset and have shown great advancement. However, AD
has not been explored with a joint analysis of audiovisual
features due to the absence of a dataset with dense and
temporal labels. This work presents E-MuMMER, a dataset
April 2023 IEEE SYSTEMS, MAN, & CYBERNETICS MAGAZINE 29
IEEE Systems, Man and Cybernetics Magazine - April 2023

Table of Contents for the Digital Edition of IEEE Systems, Man and Cybernetics Magazine - April 2023