IEEE Computational Intelligence Magazine - August 2019 - 20

to verify the theoretical results, which show that the proposed
algorithm achieves and sometimes surpasses the state-of-theart performance.

A

I. Introduction

nimals have various body structures, which perform agility locomotion to adopt their habitats appropriately.
Researchers adapt animals' behavior in robots using
biomimetics, which allows robots to shuttle rapidly
through precarious and irregular surroundings where people
cannot go to [1]. Accordingly, these robots are deployed in
search and rescue, medical endoscope, and industrial surveillance
domains to exploiting their respective advantages [2]-[4]. Compared with controlling conventional mobile or airborne robots,
the motion control on these complex dynamic (probably also
underactuated) system is exceedingly challenging [5]-[7].
Recently, deep reinforcement learning (RL) has been demonstrated to be capable of solving complex control problems
directly from high-dimensional sensory inputs without any
artificial feature engineering [8]-[11]. Deep RL has achieved
remarkable progress in Atari domains following an algorithm
named Deep Q Network (DQN) [12], which can play Atari
games identical to or even beyond a human level. The success
of DQN has fostered many researchers' interest in deep RL. As
a result, many DQN variants are proposed to improve its performance, such as a Double DQN architecture [13], prioritized
experience replay [14] and a dueling architecture of DQN [15].
Bellemare et al. [16] extend the Bellman's equation in a distributional perspective, which outperforms previously introduced
methods across most of Atari games. An algorithm combining
the aforementioned improvements, called Rainbow, is empirically studied to show which component is crucial for the overall performance in the Atari domain [17]. A scalable distributed
RL architecture can master sophisticated behaviors with raw
image inputs in less than an hour, and obtains new state-ofthe-art results on discrete control tasks [18]. Another astounding breakthrough in sequential decision making problems has
been made by Silver et al. [19]. Their algorithm combines deep
learning, RL and Monte Carlo tree search, which beats human
professional Go player. Moreover, recently the algorithm is
improved to master chess, shogi, and Go completely from
scratch without any human knowledge [20]-[22].
To deal with continuous control tasks, many appealing
methods have been proposed. In a linear optimal control
domain, sliding mode controllers and fuzzy control schemes are
presented for reverse osmosis desalination [23]. In a nonlinear
optimal control domain, a model-free optimal control scheme
for cruise control, which considers both the safety and comfort
performance, is proposed and tested on a hardware-in-the-loop
simulator [24]. A second order data-driven fuzzy controller is
developed for twin rotor aerodynamic systems and fine-tuned
by a Grey Wolf optimizer [25]. Based on an actor-critic
(AC) architecture, a Deep Deterministic Policy Gradient
(DDPG) algorithm is presented by Lillicrap et al. [26]. DDPG
presents good sample-efficiency, but it can be sensitive to

20

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2019

hyper-parameters and hard to tune. Different from DDPG, an
on-policy optimization algorithm called Trust Region Policy
Optimization (TRPO) is proposed [27]. Its core idea is to
update the policy r i (a|s) near its latest policy, which ensures
the learning stability. The distance between r i (a|s) and
old
r i (a|s)
is measured by the Kullback-Leibler (KL) divergence. A Proximal Policy Optimization (PPO) algorithm,
which only requires the first order derivative, is proposed in
[28]. In order to reduce the wall-clock learning time, several
distributed architectures are proposed to gather experiences in
parallel. Mnih et al. [29] propose an asynchronous RL algorithm. Horgan et al. [30] extend prioritized experience replay
to a distributed framework, which is referred to as Ape-X. A
distributed distributional deterministic policy gradient algorithm obtains the state-of-the-art results in continuous control
tasks by combining the Ape-X framework and distributional
value estimation [31].
The aforementioned deep RL frameworks suffer from poor
sample-efficiency and/or instability issues [8]. Sample-efficiency represents the number of experiences that an environment
needs to provide (e.g., the number of actions an agent takes and
the number of resulting states and rewards it observes) during
training in order to obtain a learning agent's optimal behavior.
Note that poor sample-efficiency is one of the main impediments when training an agent in a real world. Intuitively, an
algorithm is called sample-efficient if it can make good use of
every single piece of experience to generate and rapidly
improve its policy. Conventional deep RL algorithms add noise
to actions or use an entropy bonus for exploration. However,
inappropriate state-independent noise may lead a system to
deteriorate or even diverge. These mentioned issues remain as a
challenge and need to be addressed urgently. Our previous
work presents a robust neuro-optimal control scheme by using
adaptive dynamic programming, which provides an optimal
path-following policy for snake robots [32]. However, it
requires an accurate dynamic model of robots, which is sometimes difficult to obtain. Motivated by distributed RL frameworks [29]-[31], state-action value function variance reduction
approaches [33], [34], and a parameter space exploration technique [35], we focus on a model-free policy gradient method
and introduces a Distributed Fusion-based Policy Search
(DFPS) algorithm, which is able to learn optimal locomotion
from scratch efficiently. The main contributions of this paper
are as follows:
1) A perturbed fusion-based policy search algorithm, which
uses state-action value function and perturbed neural network architectures, is presented for variance reduction
and diversified exploration.
2) The proposed policy search algorithm is fully parallelized
in a decoupled RL architecture to maximize data throughput and accelerate learning processes. An additional offpolicy correction is added to mitigate the potential policy
lag between local actors and a central chief.
The rest of this paper is organized as follows. In Section II, an
RL problem is stated. A fusion-based policy gradient estimator
IEEE Computational Intelligence Magazine - August 2019

Table of Contents for the Digital Edition of IEEE Computational Intelligence Magazine - August 2019