IEEE Systems, Man and Cybernetics Magazine - April 2023 - 36

acoustic features and 14 head
or ientation features). However,
humans have an infinite number of
head orientations when they speak
to one another. The utilized features
represent a few head orientations
humans use to address one
another. Therefore, their work performs
poorly when humans address
one another or the robot with a different
head orientation or in new
settings. However, ours overcomes
this by extracting dense features
extracted from facial regions with
a visual encoder along different
speech segments.
In summary, this study has sevThis
work proposes
the ADNet, a twostream-based
deep
learning
framework that inputs
variable segments
of audiovisual
information to predict
the addressee frame
by frame.
eral advantages compared to all
previous works. Regarding the
dataset, the proposed dataset has two main advantages
over previous ones. It was recorded in mixed settings,
which is more realistic and consists of spatiotemporal multimodal
annotation. Regarding the approaches, all previous
works followed the same directions: fusing gaze,
current and previous utterance, text, and head orientation.
Our work sets a path for a new research direction by introducing
a new AD approach that jointly analyzes the facial
regions and audio features. ADNet employs two-stream,
deep learning-based embedding to extract dense features
from the two modalities. As a result, the proposed model
can learn complicated patterns, anticipate the addressee
in various contexts, and generalize well in new settings.
Furthermore, all previous works use a fixed number of
features extracted from the gaze or static image. However,
humans address one another or robots with short- or longterm
utterances/speech. ADNet overcomes this by
enabling the model to accept short-term and long-term
segments or variable-length segment inputs.
Finally, this work aims to indicate the way forward in
the field by introducing a novel approach. However, there
is room for improvement. Unlike other speech activity
and vision datasets, the E-MuMMER dataset is not large.
Deep learning models demand enormous amounts of data
to attain better performance. Therefore, we strongly recommend
that others in the field enhance the size of the
dataset with the introduced technique to improve the
proposed framework's performance further or for newly
introduced frameworks.
Conclusion
AD plays a significant role in endowing a robot with the
ability to identify whether a person is addressing it. However,
this area has not been widely explored as intended
using useful communication cues. Previous works heavily
relied on statistical and rule-based approaches, which
are not generalized well in new settings. The lack of a
36 IEEE SYSTEMS, MAN, & CYBERNETICS MAGAZINE April 2023
spatiotemporal dataset hinders the
area from being explored with
state-of-the-art techniques, such as
the deep learning approach. This
study proposes a new paradigm to
explore AD, proposing the joint utilization
of long- and short-term
facial and audio features.
To attain this, we introduce the
E-MuMMER dataset with dense,
spatiotemporal annotations of
spoken activity by extending the
existing HRI dataset, the MuMMER
dataset. E-MuMMER is the
first publicly available benchmark
for the AD task. In addition,
the presence of face-associated
speaking annotations makes the
dataset interesting for different
problems that require multimodal information, such as
speaker identification and active speaker detection. This
work proposes the ADNet, a two-stream-based deep learning
framework that inputs variable segments of audiovisual
information to predict the addressee frame by frame.
The ablation experiment shows that the BLF outperforms
the deterministic concatenation approach with a 1% gain
in accuracy. The experiments also reveal that the joint utilization
of the CA and SA modules enhanced the prediction
performance. The CA module contributes more than the
SA module.
Compared to previous works, our work is the first to
introduce a spatiotemporal multimodal annotated dataset
and propose a framework that inputs variable lengths of
multimodal inputs that could learn complex speech activities
and generalize well in new settings. The E-MuMMER
dataset is small compared to related computer vision datasets.
To enhance the model's prediction performance, one
can extend it further following the introduced procedure.
Acknowledgment
We greatly acknowledge support from the Zhejiang Province
Ten Thousand Talents Program under program ID
2019R51010, Zhejiang Lab Postdoctoral Start-Up Fund ID
115002-UB2107QJ, and the National Natural Science Foundation
of China (U21A20488). We appreciate the Idiap
Research Institute for providing the MuMMER dataset.
Wei Song is the corresponding author.
About the Authors
Fiseha B. Tesema (fisehab@zhejianglab.com) earned his
Ph.D. degree in computer science and technology from the
University of Electronic Science and Technology of China in
2020. He is a postdoctoral fellow at Zhejiang Lab Zhejiang
Province, 311121 Hangzhou, P.R. China. His research interests
include computer vision, human-robot interaction,
audio signal processing, cross-modal learning, multimodal

IEEE Systems, Man and Cybernetics Magazine - April 2023

Table of Contents for the Digital Edition of IEEE Systems, Man and Cybernetics Magazine - April 2023

IEEE Systems, Man and Cybernetics Magazine - April 2023 - Cover1
IEEE Systems, Man and Cybernetics Magazine - April 2023 - Cover2
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 1
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 2
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 3
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 4
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 5
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 6
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 7
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 8
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 9
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 10
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 11
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 12
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 13
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 14
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 15
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 16
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 17
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 18
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 19
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 20
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 21
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 22
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 23
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 24
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 25
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 26
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 27
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 28
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 29
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 30
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 31
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 32
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 33
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 34
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 35
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 36
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 37
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 38
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 39
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 40
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 41
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 42
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 43
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 44
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 45
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 46
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 47
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 48
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 49
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 50
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 51
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 52
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 53
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 54
IEEE Systems, Man and Cybernetics Magazine - April 2023 - Cover3
IEEE Systems, Man and Cybernetics Magazine - April 2023 - Cover4
https://www.nxtbook.com/nxtbooks/ieee/smc_202310
https://www.nxtbook.com/nxtbooks/ieee/smc_202307
https://www.nxtbook.com/nxtbooks/ieee/smc_202304
https://www.nxtbook.com/nxtbooks/ieee/smc_202301
https://www.nxtbook.com/nxtbooks/ieee/smc_202210
https://www.nxtbook.com/nxtbooks/ieee/smc_202207
https://www.nxtbook.com/nxtbooks/ieee/smc_202204
https://www.nxtbook.com/nxtbooks/ieee/smc_202201
https://www.nxtbook.com/nxtbooks/ieee/smc_202110
https://www.nxtbook.com/nxtbooks/ieee/smc_202107
https://www.nxtbook.com/nxtbooks/ieee/smc_202104
https://www.nxtbook.com/nxtbooks/ieee/smc_202101
https://www.nxtbook.com/nxtbooks/ieee/smc_202010
https://www.nxtbook.com/nxtbooks/ieee/smc_202007
https://www.nxtbook.com/nxtbooks/ieee/smc_202004
https://www.nxtbook.com/nxtbooks/ieee/smc_202001
https://www.nxtbook.com/nxtbooks/ieee/smc_201910
https://www.nxtbook.com/nxtbooks/ieee/smc_201907
https://www.nxtbook.com/nxtbooks/ieee/smc_201904
https://www.nxtbook.com/nxtbooks/ieee/smc_201901
https://www.nxtbook.com/nxtbooks/ieee/smc_201810
https://www.nxtbook.com/nxtbooks/ieee/smc_201807
https://www.nxtbook.com/nxtbooks/ieee/smc_201804
https://www.nxtbook.com/nxtbooks/ieee/smc_201801
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_1017
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0717
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0417
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0117
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_1016
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0716
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0416
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0116
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_1015
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0715
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0415
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0115
https://www.nxtbookmedia.com