Signal Processing - November 2017 - 34

as optimal control). GPS learns from them by using supervised
learning in combination with importance sampling, which corrects for off-policy samples [40]. This approach effectively biases
the search toward a good (local) optimum. GPS works in a loop,
by optimizing p- olicies to match sampled trajectories and optimizing trajectory distributions to match the policy and minimize
costs. Levine et al. [41] showed that it was possible to train visuomotor policies for a robot "end to end," straight from the RGB
pixels of the camera to motor torques, and, hence, provide one of
the seminal works in DRL.
A more commonly used method is to use a trust region, in
which optimization steps are restricted to lie within a region
where the approximation of the true cost function still holds.
By preventing updated policies from deviating too wildly from
previous policies, the chance of a catastrophically bad update is
lessened, and many algorithms that use trust regions guarantee
or practically result in monotonic improvement in policy performance. The idea of constraining each policy gradient update, as
measured by the Kullback-Leibler (KL) divergence between the
current and proposed policy, has a long history in RL [28]. One
of the newer algorithms in this line of work, TRPO, has been
shown to be relatively robust and applicable to domains with
high-dimensional inputs [70]. To achieve this, TRPO optimizes a surrogate objective function-specifically, it optimizes an
(importance sampled) advantage estimate, constrained using a
quadratic approximation of the KL divergence. While TRPO can
be used as a pure policy gradient method with a simple baseline,
later work by Schulman et al. [71] introduced generalized advantage estimation (GAE), which proposed several, more advanced
variance reduction baselines. The combination of TRPO and
GAE remains one of the state-of-the-art RL techniques in continuous control.

Actor-critic methods
Actor-critic approaches have grown in popularity as an
effective means of combining the benefits of policy search
methods with learned value functions, which are able to
learn from full returns and/or TD errors. They can benefit
from improvements in both policy gradient methods, such as
GAE [71], and value function methods, such as target networks [47]. In the last few years, DRL actor-critic methods
have been scaled up from learning simulated physics tasks
[22], [44] to real robotic visual navigation tasks [100], directly
from image pixels.
One recent development in the context of actor-critic algorithms is deterministic policy gradients (DPGs) [72], which
extend the standard policy gradient theorems for stochastic policies [97] to deterministic policies. One of the major advantages of
DPGs is that, while stochastic policy gradients integrate over both
state and action spaces, DPGs only integrate over the state space,
requiring fewer samples in problems with large action spaces. In
the initial work on DPGs, Silver et al. [72] introduced and demonstrated an off-policy actor-critic algorithm that vastly improved
upon a stochastic policy gradient equivalent in high-dimensional
continuous control problems. Later work introduced deep DPG,
which utilized neural networks to operate on high-dimensional,
34

visual state spaces [44]. In the same vein as DPGs, Heess et al.
[22] devised a method for calculating gradients to optimize stochastic policies by "reparameterizing" [30], [60] the stochasticity away from the network, thereby allowing standard gradients
to be used (instead of the high-variance REINFORCE estimator [97]). The resulting stochastic value gradient (SVG) methods
are flexible and can be used both with (SVG(0) and SVG(1))
and without (SVG(3)) value function critics, and with (SVG
(3) and SVG(1)) and without (SVG(0)) models. Later work
proceeded to integrate DPGs and SVGs with RNNs, allowing
them to solve continuous control problems in POMDPs, learning
directly from pixels [21]. Together, DPGs and SVGs can be considered algorithmic approaches for improving learning efficiency
in DRL.
An orthogonal approach to speeding up learning is to exploit
parallel computation. By keeping a canonical set of parameters
that are read by and updated in an asynchronous fashion by multiple copies of a single network, computation can be efficiently
distributed over both processing cores in a single central processing unit (CPU), and across CPUs in a cluster of machines. Using
a distributed system, Nair et al. [51] developed a framework for
training multiple DQNs in parallel, achieving both better performance and a reduction in training time. However, the simpler asynchronous advantage actor-critic (A3C) algorithm [48],
developed for both single and distributed machine settings, has
become one of the most popular DRL techniques in recent times.
A3C combines advantage updates with the actor-critic formulation and relies on asynchronously updated policy and value function networks trained in parallel over several processing threads.
The use of multiple agents, situated in their own, independent
environments, not only stabilizes improvements in the parameters, but conveys an additional benefit in allowing for more
exploration to occur. A3C has been used as a standard starting point in many subsequent works, including the work of Zhu
et al. [100], who applied it to robotic navigation in the real world
through visual inputs.
There have been several major advancements on the original
A3C algorithm that reflect various motivations in the field of
DRL. The first is actor-critic with experience replay [93], which
adds off-policy bias correction to A3C, allowing it to use experience replay to improve sample complexity. Others have attempted
to bridge the gap between value and policy-based RL, utilizing
theoretical advancements to improve upon the original A3C [50],
[54]. Finally, there is a growing trend toward exploiting auxiliary
tasks to improve the representations learned by DRL agents and,
hence, improve both the learning speed and final performance of
these agents [26], [46].

Current research and challenges
To conclude, we will highlight some current areas of research
in DRL and the challenges that still remain. Previously, we have
focused mainly on model-free methods, but we will now examine a few model-based DRL algorithms in more detail. Modelbased RL algorithms play an important role in making RL data
efficient and in trading off exploration and exploitation. After
tackling exploration strategies, we shall then address hierarchical

IEEE SIGNAL PROCESSING MAGAZINE

November 2017

Table of Contents for the Digital Edition of Signal Processing - November 2017