IEEE Computational Intelligence Magazine - August 2019 - 22

with on-policy trajectories. This limitation is removed by an
off-policy correction, to be introduced in the next section.

The total variance of the gradient estimator can be further
reduced by subtracting a state-dependent baseline Vz (s t) and
centering the learning signal. Accordingly, the policy gradient
estimate becomes
d i J (i) = E x 6d i log r i (a t|s t) (At (s t, a t) - Ar ~ (s t, a t) - Au n) /Au v@
+ E t 6d a Q ~ (s t, a)|a = n (s t) d i n i (s t)@,
r

i

B. Perturbed Parameter Space for Exploration

Except for estimating a policy gradient properly, an exploration
strategy is also very important in an RL framework. An agent
needs to explore its action space sufficiently when it interacts
with an environment. Otherwise, one may be trapped into a
poor local minima. Plappert et al. [35] state that parameter
space noise is much better than simply adding noise to action
space. The action noise is independent of state s t, action a t, and
varies every step even when the agent receives the same observation. Moreover, the scale of the added noise is a tricky hyperparameter to tune in practice. Bad initial exploration noise may
cause the system to diverge.
In contrast to action space noise, adding noise in neural network weights ensures the consistency in actions within the
whole episode. The parametric noise is sampled from a Gaussian distribution N (n, v # e), where e is a normalized random
matrix. The noisy network parameters are obtained by using
the reparameterization trick:

(6)

where Au n, Au v denote batch statistics of the advantage function,
and
At (s t, a t) = R (s t, a t) - Vz (s t),
Ar ~ (s t, a t) = d a Q w (s t, a t)|a = n (st) (a t - n i (s t)) .
i

The off-policy critic Q ~ is trained with uniformly sampled
mini-batches from an experience replay buffer, i.e.,
L Q = E 6(y t - Q ~ (s t, a t)) 2@,
y t = rt + cQ l~l (s t +1, nlil (s t +1)),
~

(7)

where Q l~l and nlil are target networks, which help to stabilize
a learning process. The weights of the target networks are
updated by an Exponential Moving Average (EMA) of Q ~ and
n i respectively. This strategy prevents Q ~ from chasing a moving target and transforms RL to a supervised learning problem.
The computational graph of the proposed estimator is
shown in Fig. 1. E episode trajectories are collected to train
value function Vz and compute advantage At , which are then
added to an experience replay buffer. Once the replay buffer
stores enough data, a mini-batch is uniformly sampled from it
to train Q ~ . Subsequently, the target critic and target actor are
updated by using EMA from their corresponding network,
respectively. Finally, the critic gradient is added in the policy
gradient to reduce the variance of its gradient estimation. Note
that the fusion-based gradient estimation (6) is similar to the
low variance gradient estimate strategy introduced in [33] and
[34], but uses statistics batch mean and standard deviation to
trade variance for bias. Moreover, only Q ~ is learned through
off-policy, while the policy network still needs to be trained

Critic

Critic Loss

Target Value

TD Error

g

(8)

_ n + v # e.

Accordingly, a fully-connected (FC) layer is changed to:
(9)

y = f (g w x + g b),
w

q#p

where f ( $ ) denotes an activation function, g ! 0
and
b
q
g ! 0 represent the noisy weight and noisy bias, respectively,
for an FC layer with p inputs and q outputs. The perturbed
neural network can be simply implemented by replacing the
weights and biases of the noisy parameters. The difference
between action space noise and perturbed neural network
structure is shown in Fig. 2. Note that the noisy parameters are
sampled after the update operation, and keep fixed during the
whole episode. To avoid sampling inappropriate noisy parameters, a dynamic regulation strategy is added to scaling them by
using KL divergence.

Policy Loss
Actor Gradients

Policy

Log Probability

Value
Critic

Target Critic

EMA

Value Loss

Value

Actor

EMA

Target Actor

FIGURE 1 Computational graph of a fusion-based policy gradient estimator. The critic Q ~ is added in policy loss to reduce variance, and target
networks are trained through EMA of the learned ones to maintain stability.

22

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2019



IEEE Computational Intelligence Magazine - August 2019

Table of Contents for the Digital Edition of IEEE Computational Intelligence Magazine - August 2019

Contents
IEEE Computational Intelligence Magazine - August 2019 - Cover1
IEEE Computational Intelligence Magazine - August 2019 - Cover2
IEEE Computational Intelligence Magazine - August 2019 - Contents
IEEE Computational Intelligence Magazine - August 2019 - 2
IEEE Computational Intelligence Magazine - August 2019 - 3
IEEE Computational Intelligence Magazine - August 2019 - 4
IEEE Computational Intelligence Magazine - August 2019 - 5
IEEE Computational Intelligence Magazine - August 2019 - 6
IEEE Computational Intelligence Magazine - August 2019 - 7
IEEE Computational Intelligence Magazine - August 2019 - 8
IEEE Computational Intelligence Magazine - August 2019 - 9
IEEE Computational Intelligence Magazine - August 2019 - 10
IEEE Computational Intelligence Magazine - August 2019 - 11
IEEE Computational Intelligence Magazine - August 2019 - 12
IEEE Computational Intelligence Magazine - August 2019 - 13
IEEE Computational Intelligence Magazine - August 2019 - 14
IEEE Computational Intelligence Magazine - August 2019 - 15
IEEE Computational Intelligence Magazine - August 2019 - 16
IEEE Computational Intelligence Magazine - August 2019 - 17
IEEE Computational Intelligence Magazine - August 2019 - 18
IEEE Computational Intelligence Magazine - August 2019 - 19
IEEE Computational Intelligence Magazine - August 2019 - 20
IEEE Computational Intelligence Magazine - August 2019 - 21
IEEE Computational Intelligence Magazine - August 2019 - 22
IEEE Computational Intelligence Magazine - August 2019 - 23
IEEE Computational Intelligence Magazine - August 2019 - 24
IEEE Computational Intelligence Magazine - August 2019 - 25
IEEE Computational Intelligence Magazine - August 2019 - 26
IEEE Computational Intelligence Magazine - August 2019 - 27
IEEE Computational Intelligence Magazine - August 2019 - 28
IEEE Computational Intelligence Magazine - August 2019 - 29
IEEE Computational Intelligence Magazine - August 2019 - 30
IEEE Computational Intelligence Magazine - August 2019 - 31
IEEE Computational Intelligence Magazine - August 2019 - 32
IEEE Computational Intelligence Magazine - August 2019 - 33
IEEE Computational Intelligence Magazine - August 2019 - 34
IEEE Computational Intelligence Magazine - August 2019 - 35
IEEE Computational Intelligence Magazine - August 2019 - 36
IEEE Computational Intelligence Magazine - August 2019 - 37
IEEE Computational Intelligence Magazine - August 2019 - 38
IEEE Computational Intelligence Magazine - August 2019 - 39
IEEE Computational Intelligence Magazine - August 2019 - 40
IEEE Computational Intelligence Magazine - August 2019 - 41
IEEE Computational Intelligence Magazine - August 2019 - 42
IEEE Computational Intelligence Magazine - August 2019 - 43
IEEE Computational Intelligence Magazine - August 2019 - 44
IEEE Computational Intelligence Magazine - August 2019 - 45
IEEE Computational Intelligence Magazine - August 2019 - 46
IEEE Computational Intelligence Magazine - August 2019 - 47
IEEE Computational Intelligence Magazine - August 2019 - 48
IEEE Computational Intelligence Magazine - August 2019 - 49
IEEE Computational Intelligence Magazine - August 2019 - 50
IEEE Computational Intelligence Magazine - August 2019 - 51
IEEE Computational Intelligence Magazine - August 2019 - 52
IEEE Computational Intelligence Magazine - August 2019 - 53
IEEE Computational Intelligence Magazine - August 2019 - 54
IEEE Computational Intelligence Magazine - August 2019 - 55
IEEE Computational Intelligence Magazine - August 2019 - 56
IEEE Computational Intelligence Magazine - August 2019 - 57
IEEE Computational Intelligence Magazine - August 2019 - 58
IEEE Computational Intelligence Magazine - August 2019 - 59
IEEE Computational Intelligence Magazine - August 2019 - 60
IEEE Computational Intelligence Magazine - August 2019 - 61
IEEE Computational Intelligence Magazine - August 2019 - 62
IEEE Computational Intelligence Magazine - August 2019 - 63
IEEE Computational Intelligence Magazine - August 2019 - 64
IEEE Computational Intelligence Magazine - August 2019 - 65
IEEE Computational Intelligence Magazine - August 2019 - 66
IEEE Computational Intelligence Magazine - August 2019 - 67
IEEE Computational Intelligence Magazine - August 2019 - 68
IEEE Computational Intelligence Magazine - August 2019 - 69
IEEE Computational Intelligence Magazine - August 2019 - 70
IEEE Computational Intelligence Magazine - August 2019 - 71
IEEE Computational Intelligence Magazine - August 2019 - 72
IEEE Computational Intelligence Magazine - August 2019 - 73
IEEE Computational Intelligence Magazine - August 2019 - 74
IEEE Computational Intelligence Magazine - August 2019 - 75
IEEE Computational Intelligence Magazine - August 2019 - 76
IEEE Computational Intelligence Magazine - August 2019 - Cover3
IEEE Computational Intelligence Magazine - August 2019 - Cover4
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202311
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202308
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202305
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202302
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202211
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202208
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202205
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202202
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202111
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202108
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202105
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202102
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202011
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202008
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202005
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202002
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201911
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201908
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201905
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201902
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201811
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201808
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201805
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201802
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_winter17
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_fall17
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_summer17
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_spring17
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_winter16
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_fall16
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_summer16
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_spring16
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_winter15
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_fall15
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_summer15
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_spring15
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_winter14
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_fall14
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_summer14
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_spring14
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_winter13
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_fall13
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_summer13
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_spring13
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_winter12
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_fall12
https://www.nxtbookmedia.com