IEEE Computational Intelligence Magazine - August 2019 - 22

with on-policy trajectories. This limitation is removed by an
off-policy correction, to be introduced in the next section.

The total variance of the gradient estimator can be further
reduced by subtracting a state-dependent baseline Vz (s t) and
centering the learning signal. Accordingly, the policy gradient
estimate becomes
d i J (i) = E x 6d i log r i (a t|s t) (At (s t, a t) - Ar ~ (s t, a t) - Au n) /Au v@
+ E t 6d a Q ~ (s t, a)|a = n (s t) d i n i (s t)@,
r

i

B. Perturbed Parameter Space for Exploration

Except for estimating a policy gradient properly, an exploration
strategy is also very important in an RL framework. An agent
needs to explore its action space sufficiently when it interacts
with an environment. Otherwise, one may be trapped into a
poor local minima. Plappert et al. [35] state that parameter
space noise is much better than simply adding noise to action
space. The action noise is independent of state s t, action a t, and
varies every step even when the agent receives the same observation. Moreover, the scale of the added noise is a tricky hyperparameter to tune in practice. Bad initial exploration noise may
cause the system to diverge.
In contrast to action space noise, adding noise in neural network weights ensures the consistency in actions within the
whole episode. The parametric noise is sampled from a Gaussian distribution N (n, v # e), where e is a normalized random
matrix. The noisy network parameters are obtained by using
the reparameterization trick:

(6)

where Au n, Au v denote batch statistics of the advantage function,
and
At (s t, a t) = R (s t, a t) - Vz (s t),
Ar ~ (s t, a t) = d a Q w (s t, a t)|a = n (st) (a t - n i (s t)) .
i

The off-policy critic Q ~ is trained with uniformly sampled
mini-batches from an experience replay buffer, i.e.,
L Q = E 6(y t - Q ~ (s t, a t)) 2@,
y t = rt + cQ l~l (s t +1, nlil (s t +1)),
~

(7)

where Q l~l and nlil are target networks, which help to stabilize
a learning process. The weights of the target networks are
updated by an Exponential Moving Average (EMA) of Q ~ and
n i respectively. This strategy prevents Q ~ from chasing a moving target and transforms RL to a supervised learning problem.
The computational graph of the proposed estimator is
shown in Fig. 1. E episode trajectories are collected to train
value function Vz and compute advantage At , which are then
added to an experience replay buffer. Once the replay buffer
stores enough data, a mini-batch is uniformly sampled from it
to train Q ~ . Subsequently, the target critic and target actor are
updated by using EMA from their corresponding network,
respectively. Finally, the critic gradient is added in the policy
gradient to reduce the variance of its gradient estimation. Note
that the fusion-based gradient estimation (6) is similar to the
low variance gradient estimate strategy introduced in [33] and
[34], but uses statistics batch mean and standard deviation to
trade variance for bias. Moreover, only Q ~ is learned through
off-policy, while the policy network still needs to be trained

Critic

Critic Loss

Target Value

TD Error

g

(8)

_ n + v # e.

Accordingly, a fully-connected (FC) layer is changed to:
(9)

y = f (g w x + g b),
w

q#p

where f ( $ ) denotes an activation function, g ! 0
and
b
q
g ! 0 represent the noisy weight and noisy bias, respectively,
for an FC layer with p inputs and q outputs. The perturbed
neural network can be simply implemented by replacing the
weights and biases of the noisy parameters. The difference
between action space noise and perturbed neural network
structure is shown in Fig. 2. Note that the noisy parameters are
sampled after the update operation, and keep fixed during the
whole episode. To avoid sampling inappropriate noisy parameters, a dynamic regulation strategy is added to scaling them by
using KL divergence.

Policy Loss
Actor Gradients

Policy

Log Probability

Value
Critic

Target Critic

EMA

Value Loss

Value

Actor

EMA

Target Actor

FIGURE 1 Computational graph of a fusion-based policy gradient estimator. The critic Q ~ is added in policy loss to reduce variance, and target
networks are trained through EMA of the learned ones to maintain stability.

22

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2019
IEEE Computational Intelligence Magazine - August 2019

Table of Contents for the Digital Edition of IEEE Computational Intelligence Magazine - August 2019