IEEE Systems, Man and Cybernetics Magazine - April 2023 - 34

The feature functions fl : a R1 128
#
A # "
take as input the CA features a and v, respectively,
and location l and output a feature of size 1 128#
R1 128#
. The feature
outputs are combined at each location using the outer
matrix product; i.e., the bilinear combination of fA
and fV
at location l is given by
(, ,, ,) (, )( ,)
bilinear la vf ff la fl v
AV A
=
where a is ACA R1 128
#
"
BL (, )avF
"
T
V
and v is VCA R1 128
#
.
aims to aggregate the bilinear combination
of two cross-modal attention features at each pixel location
l along a temporal direction t [45]:
BLFbilinear la vf f
lL
(, )(,, ,, )
(, )( ,).
av
=
=
|
|
!
fl af lv
T
A
lL
!
V
The resulting feature BLF R1 256
#
"
(9)
is generated to capture
the multiplicative interaction at the corresponding
spatial location. The BLF is added after the audiovisual CA
layers fuse the audio and video attention features to generate
the fused feature ().Eav
SA
SA, also known as intra-attention, computes the response
at a position in a sequence by attending to all positions
within the same sequence [44]. We adopt SA to find a relevant
feature or long-term cue. It inputs the fused features
()Eav
(),Kav
and value (),Vav
AV
Experiments
Implementation Details, Model Complexity,
and Inference Time
The whole network was trained in an end-to-end manner.
The PyTorch library has been used to develop ADNet and
adopted the stochastic gradient descent optimizer. The initial
learning rate is set to 10 4(8)
and
:fl v# "
V
In this work, we considered AD as a frame-level classification
task. The cross-entropy loss is adopted to
compare the predicted sequence with the ground truth
sequence. The adopted loss function is illustrated in
(10), where pi
and y ,i
jN1!6
respectively, are the predicted
and the ground truth AD labels of the jth video frame,
,,
L=- |
j =1
@ and N refers to the number of video frames:
lo () ().glog
N yp yp
1
N
jj ij
+11(10)
and
decreases by 5% for
every epoch. One transformer layer with eight attention
heads is set for the CA and SA networks. We undertake
random flipping, rotation, and crop operations to augment
the images.
The model inference time was tested on an NVIDIA
GeForce GPU with 12-GB memory. It predicts a single face
within 1.45 ms. The complexity analysis [44], [46] of the
proposed frameworks is shown in Table 4.
Ablation Experiments
from the BLF to model the audiovisual utterance-level
temporal information. Apart from the query (),Qav
which come from the joint audiovisual
feature, the SA architecture is identical to the CA
network, shown in Figure 7(b). This module aims to distinguish
between addressing the robot and addressing other
subject frames.
Loss Function
Following the SA layer, a fully connected layer and
softmax operation are attached to project the output
features of the SA network to the AD label sequence.
Table 4. Per-layer complexity.
Layer Type
Fully connected layer
SA
Convolution
BLF
key
Complexity per Layer
On2
()
On d2
(. )
Ok nd2
()
(. .)
On2
n is the sequence length, d is the representation dimension, and k is the
kernel size of convolution.
BLF Versus Concatenation
We investigate the performance of the proposed model,
comparing the BLF and deterministic feature fusion
approaches to choose a better fusion approach. The
experiments were conducted in the same settings on the
E-MuMMER test for a fair comparison. The only difference
is that we replace the BLF with concatenation on
the baseline framework (ADNet). As shown in Table 5,
though the concatenation-based approach showed better
prediction results in terms of recall, we found out that
BLF showed a 1% accuracy improvement over the concatenation
approach and better precision. Unlike the concatenation
module, which blindly fuses heterogeneous
features, BLF considers the interaction of local features
between ACA and VCA features, capturing the pairwise
feature relationships and leading to a better classification
accuracy than the concatenation module.
Does the Attention Module Help?
We undertake the ablation experiment to observe the contribution
of the attention module, removing two attention
modules, one at a time. The results are shown in Table 6
and discussed as follows:
◆ Without CA: When the CA module is removed, the
model does not show a performance gain in terms of
Bal.Accu. We found that SA better distinguishes speech
34 IEEE SYSTEMS, MAN, & CYBERNETICS MAGAZINE April 2023
IEEE Systems, Man and Cybernetics Magazine - April 2023

Table of Contents for the Digital Edition of IEEE Systems, Man and Cybernetics Magazine - April 2023