IEEE Computational Intelligence Magazine - February 2022 - 70

While it is possible to use the notion of contribution
on classification or regression model predictions,
it is also possible to use it on an RL reward r as a
final payout that needs to be explained, or " broken
down " .
that, when considered individually, are coherent for the excluded
agent with respect to the others. However, in the referential
of a single agent, the standard deviation between their respective
values is important. In particular, there is a significant gap
between noop and the other two methods. Figure 7 shows
there is an average gap of 73.1 reward units between noop and
random_player_action, while there is only an average gap of 20.5
reward units between random and random_player_action). This can
be explained by the fact that randomly moving agents disturb
the game significantly more than immobilized ones, as they can
get negative rewards by hitting the map borders in Multiagent
Particle or killing trees in Harvest (i.e., harvesting all apples that
are contiguous). Thus, when using random or replace strategy, a
majority of coalitions are " parasited " by these negative rewards
that contribute towards lowering the global reward and lead to
an overall lower Shapley value than the noop method (as
observed in Figures 2, 5 and 7). Therefore, in that context,
noop action selection seems to be the most faithful method to
get Shapley values assessing the agents' true contributions
closely, and free from random and unwanted negative rewards.
VI. Discussion
This article demonstrated the usefulness of Shapley values and
their Monte Carlo approximation for explaining RL models in
cooperative settings. These values provide a form of explanation,
i.e., continuous values that are understandable by researchers and
developers, since they represent a portion of the reward value of
the agents team, partitioned according to each agent's contribution.
They could also provide explanations for the general public
that may perceive them as an intrinsic " value " of each agent,
making them accountable for the effectiveness of the system.
Moreover, Shapley values could be a good way to detect biases in
the training of an RL model, since they require analyzing the
individual behavior of each agent and this could highlight disparities
between their different strategies and abilities.
Concerning the player exclusion method to replace missing
agents from a coalition, noop (no-operation) action seems to be
the most neutral, and interaction-free method when the environment
offers this possibility, since methods using a substitution
mechanism mandated by random-selection of actions are
prone to get high negative rewards and interfere in the game.
Social interaction between agents was also explored and this
investigation showed that Shapley values are able to effectively
capture both efficiency and equality metrics, while they are still
able to be computed even though the reward is globally shared
between agents. This is a huge advantage when the environment
does not enable fair individual-level credit assignment. In
70 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | FEBRUARY 2022
consequence, it can be asserted that Shapley
Values are an effective way to explain the contributions
of RL agents, and, to some extent,
the relationships between them.
However, our approach is limited to multiagent
cooperative RL and, in its current form,
cannot be applied to competitive and singleagent
models. In addition, it cannot be used to
explain an agent's actions, their sustainability in time, nor
explain a specific episode of interest, as it only provides an average
metric for the contribution of each agent in a cooperative
game, with the total of Shapley values corresponding to the
mean global reward of the grand coalition (i.e., the one containing
all agents). Thus, it must be considered as a way to get a
first ranking of contributions of agents in a model. Finally,
while the Monte Carlo method to estimate Shapley values (see
Section IV) is more efficient than computing the exact Shapley
values, it still remains time consuming. Future work should
seek to keep accurate value estimation of SHAP values while
accelerating their computational approximation.
VII. Conclusion and Future Work
The three research questions were positively answered, with
experiments conducted in two socially challenging multi-agent
RL environments (Harvest from Sequential Social Dilemmas
[15], [43] and Particle Multiagent [14]) and two different RL
algorithms (MADDPG [14] and A3C [1]). Experiments
showed that the computation of Shapley values could be a
potential breakthrough elucidating understanding towards
attaining multi-agent XRL environments. They can efficiently
assess the contribution of agents to the global reward in cooperative
settings. They also provide insightful information about
the agents' behavior and their social interactions.
Nonetheless, numerous issues remain to be explained in
future work. Different interpretations of Shapley values to further
explain deep RL issues must be explored to increase the
levels of explanation granularity. Robustness and reproducibility
remain a critical issue for XRL (and XAI in a more general
sense), and other statistical methods could prove very useful for
that purpose, as presented in [45]-[47] (e.g., Winsorised or
trimmed estimators). Moreover, Shapley values could also be
combined with a robust model selection measure (such as the
Lorenz Zonoids [21]). Besides, Shapley values or other additive
and non-additive methods could be used not only to explain
the roles taken by agents when learning a policy to achieve a
collaborative task, but also to detect defects in agents while
training, or in the fed data. Furthermore, the dynamic nature of
RL (vs. the static settings of most ML models where only a single
data point needs to be explained) could be taken into
account in order to create a novel approach that evaluates the
contributions of agents through time (e.g. during evaluation
time). Here, " temporal " Shapley values could be approximated
with a model as in [24]. However, one of the main advantages
of SHAP being a post-hoc XAI method (i.e., being agnostic to
the RL algorithm) would be lost, as the Shapley prediction
IEEE Computational Intelligence Magazine - February 2022

Table of Contents for the Digital Edition of IEEE Computational Intelligence Magazine - February 2022