IEEE Computational Intelligence Magazine - August 2019 - 24

chief 's. Accordingly, an off-policy correction method is used
next to prevent learning process deterioration.
B. Off-Policy Correction

Using a decoupled distributed AC structure can increase the
data throughput significantly. However, it introduces a policy
lag between local actors and the central chief, meaning that the
policy of local actors may be updated after the central chief
updates its parameters several time steps later. The policy lag
can deteriorate the learning performance or even lead to
unstable learning [18]. In this section, a truncated Importance
Sampling (IS) is used to mitigate this side effect while maintaining high data throughput. Instead of using the sum of discounted rewards to compute expected return, we use n-step
return to estimate the value,
Rt t _ rt + crt +1 + g + c n -1 rt +n -1 + c n Vz (s t +n)
t +n -1

=

/

c

k -t

rk + c n Vz (s t +n) .

(10)

k =t

The n-step return Rt t is a trade off between one-step return
and computing sum of discounted rewards. It can make the
rewards propagate faster and facilitate a recurrent neural network, which has an advantage in policy parameterization of
sequence decision making. Via a simple calculation, the n-step
return Rt t can be rewritten as:
Rt t =

t +n -1

/

c

k -t

(rk + cVz (s k +1) - Vz (s k)) + Vz (s t) .

(11)

k =t

Subsequently, the first part of the n- step return Rt t
in (11) is multiplied by a truncated IS factor t k _ min (tr,
(r i (a k|s k) r il (a k|s k))), where tr $ 1 is a hyperparameter to
remove the incentive when policy r i changes too much from
old policy r il. Then, the truncated IS return Rt t becomes,
Ytt =

t +n -1

/

c

k -t

t k d k + Vz (s t),

(12)

k =t

where d k = rk + cVz (s k +1) - Vz (s k) is the TD error of value
faction Vz at step k. Note that, if r i (a k|s k) = r il (a k|s k), i.e.,
there is no policy lag, the truncated IS return Ytt reduces to the
original n- step return Rt t . Therefore, it can be taken as a general Bellman target for both on- and off-policies. The truncated
IS factor t k controls the fix point of the TD error of Vz . For
example, if the truncation coefficient tr is infinite, then t k
reduces to a normal IS factor, and the fixed point of the update
rule is V rz . If tr is close to zero, value function V rz l becomes
the expected value of the old policy. Otherwise, the value function Vz is an expected value between r i and r il.
For further adjusting the speed at which the truncated IS
return Ytt converges to its fix point, an additional weight factor
is interpolated in (12) to measure how much a temporal difference t t d t observed at time t impacts the update of the value
function at a previous time i. Accordingly, the truncated IS
return becomes,
i

24

i

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2019

Ytt =

t +n -1

/

k =t

c

k -t

e % c i o t k d k + Vz (s t),
t -1

(13)

i =k

where weight factor c i _ min (cr, (r i (a i|s i) r il (a i|s i))) can be
referred to as a "trace cutting" coefficient in [38] to further
reduce the variance. This inner product value increases when
the distance between current and previous policies r i and r il
gets larger. These weight factors do not impact the fixed point
to which the value function converges. They only influence the
convergence speed of the function. Truncated IS factors, i.e., t k
and c i, play different roles in Ytt . The former determines the
fixed point of value function Vz, while the latter impacts the
convergence speed. Subsequently, the policy gradient estimator
can now be fully parallelized and learn from off-policy data.
DFPS is presented next, which is integrated with a fusion-based
gradient estimation technique, parameter noise exploitation, and
off-policy correction, aiming to maximize data throughput and
achieve better performance than batched update.
C. Distributed Fusion-based Policy Search Algorithm

Distributed fusion-based policy search consists of policy r i,
value function Vz, and Q function Q ~, which are both
parameterized by deep neural networks. The policy learning
scheme is illustrated in Fig. 4. At the beginning of an episode,
the central chief initializes its weights with parameter space
noise and pushes them to m local actors. These actors interact
with their environment and collect trajectories independently.
Subsequently, the trajectories are sent to the central chief
through a queue, and stored in an experience replay buffer. At
the training time, critic Q ~ is learned by using (7) and sampling mini-batches from the replay buffer. The target networks are slowly updated by EMA with respect to their
corresponding neural networks, which makes the learning
process stable.
Subsequently, the learned critic Q ~ is interpolated in a gradient estimator to reduce the variance. The central chief
updates its policy r i in the direction of distributed fusionbased policy gradient estimation,
t t d i log r i ^ a t

s t h^rt + cYtt +1 - Vz ^s t h Ar ~ (s t, a t) - Au nh Au v + d a Q ~ ^s t, ah d i n i ^s t h.

(14)

The parameter noises are resampled from a Gaussian distribution periodically and added in neural network weights. Then,
the updated weights of policy r i are pushed to local actors. At
last, Vz is updated by l 2 loss of the truncated IS return Ytt by
using mini-batches sampled from the replay buffer,
d z Vz ^s t h^Ytt - Vz ^s t hh.

(15)

Notice that the value function Vz (s t) should be updated
after policy r i, and otherwise additional bias can be introduced
in gradients estimation. In practice, we use IS to replace the log
probability and add two regulation terms, adjusted by b and h,
to ensure that the policy is updated in a trust region. The Algorithm 1 shows the pseudo-code of the proposed method.



IEEE Computational Intelligence Magazine - August 2019

Table of Contents for the Digital Edition of IEEE Computational Intelligence Magazine - August 2019

Contents
IEEE Computational Intelligence Magazine - August 2019 - Cover1
IEEE Computational Intelligence Magazine - August 2019 - Cover2
IEEE Computational Intelligence Magazine - August 2019 - Contents
IEEE Computational Intelligence Magazine - August 2019 - 2
IEEE Computational Intelligence Magazine - August 2019 - 3
IEEE Computational Intelligence Magazine - August 2019 - 4
IEEE Computational Intelligence Magazine - August 2019 - 5
IEEE Computational Intelligence Magazine - August 2019 - 6
IEEE Computational Intelligence Magazine - August 2019 - 7
IEEE Computational Intelligence Magazine - August 2019 - 8
IEEE Computational Intelligence Magazine - August 2019 - 9
IEEE Computational Intelligence Magazine - August 2019 - 10
IEEE Computational Intelligence Magazine - August 2019 - 11
IEEE Computational Intelligence Magazine - August 2019 - 12
IEEE Computational Intelligence Magazine - August 2019 - 13
IEEE Computational Intelligence Magazine - August 2019 - 14
IEEE Computational Intelligence Magazine - August 2019 - 15
IEEE Computational Intelligence Magazine - August 2019 - 16
IEEE Computational Intelligence Magazine - August 2019 - 17
IEEE Computational Intelligence Magazine - August 2019 - 18
IEEE Computational Intelligence Magazine - August 2019 - 19
IEEE Computational Intelligence Magazine - August 2019 - 20
IEEE Computational Intelligence Magazine - August 2019 - 21
IEEE Computational Intelligence Magazine - August 2019 - 22
IEEE Computational Intelligence Magazine - August 2019 - 23
IEEE Computational Intelligence Magazine - August 2019 - 24
IEEE Computational Intelligence Magazine - August 2019 - 25
IEEE Computational Intelligence Magazine - August 2019 - 26
IEEE Computational Intelligence Magazine - August 2019 - 27
IEEE Computational Intelligence Magazine - August 2019 - 28
IEEE Computational Intelligence Magazine - August 2019 - 29
IEEE Computational Intelligence Magazine - August 2019 - 30
IEEE Computational Intelligence Magazine - August 2019 - 31
IEEE Computational Intelligence Magazine - August 2019 - 32
IEEE Computational Intelligence Magazine - August 2019 - 33
IEEE Computational Intelligence Magazine - August 2019 - 34
IEEE Computational Intelligence Magazine - August 2019 - 35
IEEE Computational Intelligence Magazine - August 2019 - 36
IEEE Computational Intelligence Magazine - August 2019 - 37
IEEE Computational Intelligence Magazine - August 2019 - 38
IEEE Computational Intelligence Magazine - August 2019 - 39
IEEE Computational Intelligence Magazine - August 2019 - 40
IEEE Computational Intelligence Magazine - August 2019 - 41
IEEE Computational Intelligence Magazine - August 2019 - 42
IEEE Computational Intelligence Magazine - August 2019 - 43
IEEE Computational Intelligence Magazine - August 2019 - 44
IEEE Computational Intelligence Magazine - August 2019 - 45
IEEE Computational Intelligence Magazine - August 2019 - 46
IEEE Computational Intelligence Magazine - August 2019 - 47
IEEE Computational Intelligence Magazine - August 2019 - 48
IEEE Computational Intelligence Magazine - August 2019 - 49
IEEE Computational Intelligence Magazine - August 2019 - 50
IEEE Computational Intelligence Magazine - August 2019 - 51
IEEE Computational Intelligence Magazine - August 2019 - 52
IEEE Computational Intelligence Magazine - August 2019 - 53
IEEE Computational Intelligence Magazine - August 2019 - 54
IEEE Computational Intelligence Magazine - August 2019 - 55
IEEE Computational Intelligence Magazine - August 2019 - 56
IEEE Computational Intelligence Magazine - August 2019 - 57
IEEE Computational Intelligence Magazine - August 2019 - 58
IEEE Computational Intelligence Magazine - August 2019 - 59
IEEE Computational Intelligence Magazine - August 2019 - 60
IEEE Computational Intelligence Magazine - August 2019 - 61
IEEE Computational Intelligence Magazine - August 2019 - 62
IEEE Computational Intelligence Magazine - August 2019 - 63
IEEE Computational Intelligence Magazine - August 2019 - 64
IEEE Computational Intelligence Magazine - August 2019 - 65
IEEE Computational Intelligence Magazine - August 2019 - 66
IEEE Computational Intelligence Magazine - August 2019 - 67
IEEE Computational Intelligence Magazine - August 2019 - 68
IEEE Computational Intelligence Magazine - August 2019 - 69
IEEE Computational Intelligence Magazine - August 2019 - 70
IEEE Computational Intelligence Magazine - August 2019 - 71
IEEE Computational Intelligence Magazine - August 2019 - 72
IEEE Computational Intelligence Magazine - August 2019 - 73
IEEE Computational Intelligence Magazine - August 2019 - 74
IEEE Computational Intelligence Magazine - August 2019 - 75
IEEE Computational Intelligence Magazine - August 2019 - 76
IEEE Computational Intelligence Magazine - August 2019 - Cover3
IEEE Computational Intelligence Magazine - August 2019 - Cover4
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202311
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202308
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202305
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202302
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202211
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202208
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202205
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202202
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202111
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202108
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202105
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202102
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202011
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202008
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202005
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202002
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201911
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201908
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201905
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201902
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201811
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201808
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201805
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201802
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_winter17
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_fall17
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_summer17
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_spring17
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_winter16
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_fall16
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_summer16
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_spring16
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_winter15
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_fall15
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_summer15
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_spring15
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_winter14
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_fall14
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_summer14
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_spring14
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_winter13
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_fall13
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_summer13
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_spring13
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_winter12
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_fall12
https://www.nxtbookmedia.com