Signal Processing - November 2017 - 35

RL (HRL), which imposes an inductive bias on the final policy
by explicitly factorizing it into several levels. When available,
trajectories from other controllers can be used to bootstrap the
learning process, leading us to imitation learning and inverse RL
(IRL). For the final topic, we will look at multiagent systems,
which have their own special considerations.

Model-based RL
The key idea behind model-based RL is to learn a transition
model that allows for simulation of the environment without
interacting with the environment directly. Model-based RL
does not assume specific prior knowledge. However, in practice, we can incorporate prior knowledge (e.g., physics-based
models [29]) to speed up learning. Model learning plays an
important role in reducing the number of required interactions with the (real) environment, which may be limited in
practice. For example, it is unrealistic to perform millions of
experiments with a robot in a reasonable amount of time and
without significant hardware wear and tear. There are various
approaches to learn predictive models of dynamical systems
using pixel information. Based on the deep dynamical model
[90], where high-dimensional observations are embedded into
a lower-dimensional space using autoencoders, several modelbased DRL algorithms have been proposed for learning models
and policies from pixel information [55], [91], [95]. If a sufficiently accurate model of the environment can be learned, then
even simple controllers can be used to control a robot directly
from camera images [14]. Learned models can also be used to
guide exploration purely based on simulation of the environment, with deep models allowing these techniques to be scaled
up to high-dimensional visual domains [75].
Although deep neural networks can make reasonable predictions in simulated environments over hundreds of time steps [10],
they typically require many samples to tune the large number
of parameters they contain. Training these models often requires
more samples (interaction with the environment) than simpler
models. For this reason, Gu et al. [19] train locally linear models for use with the NAF algorithm-the continuous equivalent
of the DQN [47]-to improve the algorithm's sample complexity in the robotic domain where samples are expensive. It seems
likely that the usage of deep models in model-based DRL could
be massively spurred by general advances in improving the data
efficiency of neural networks.

Exploration versus exploitation
One of the greatest difficulties in RL is the fundamental dilemma
of exploration versus exploitation: When should the agent try out
(perceived) nonoptimal actions to explore the environment (and
potentially improve the model), and when should it exploit the
optimal action to make useful progress? Off-policy algorithms,
such as the DQN [47], typically use the simple e -greedy exploration policy, which chooses a random action with probability e !
[0, 1], and the optimal action otherwise. By decreasing e over
time, the agent progresses toward exploitation. Although adding
independent noise for exploration is usable in continuous control
problems, more sophisticated strategies inject noise that is corre-

lated over time (e.g., from stochastic processes) to better preserve
momentum [44].
The observation that temporal correlation is important led
Osband et al. [56] to propose the bootstrapped DQN, which
maintains several Q-value "heads" that learn different values
through a combination of different weight initializations and
bootstrapped sampling from experience replay memory. At
the beginning of each training episode, a different head is chosen, leading to temporally extended exploration. Usunier et al.
[85] later proposed a similar method that performed exploration in policy space by adding noise to a single output head,
using zero-order gradient estimates to allow backpropagation
through the policy.
One of the main principled exploration strategies is the
upper confidence bound (UCB) algorithm, based on the principle of "optimism in the face of uncertainty" [36]. The idea
behind UCB is to pick actions that maximize E 6R@ + lv 6R@,
where v [R] is the standard deviation of the return and l 2 0.
UCB therefore encourages exploration in regions with high
uncertainty and moderate expected return. While easily achievable in small tabular cases, the use of powerful density models
has allowed this algorithm to scale to high-dimensional visual
domains with DRL [4].
UCB can also be considered one way of implementing
intrinsic motivation, which is a general concept that advocates
decreasing uncertainty/making progress in learning about the
environment [68]. There have been several DRL algorithms that
try to implement intrinsic motivation via minimizing model prediction error [57], [75] or maximizing information gain [25], [49].

Hierarchical RL
In the same way that deep learning relies on hierarchies of features, HRL relies on hierarchies of policies. Early work in this
area introduced options, in which, apart from primitive actions
(single time-step actions), policies could also run other policies (multitime-step "actions") [79]. This approach allows toplevel policies to focus on higher-level goals, while subpolicies
are responsible for fine control. Several works in DRL have
attempted HRL by using one top-level policy that chooses
between subpolicies, where the division of states or goals in to
subpolicies is achieved either manually [1], [34], [82] or automatically [2], [88], [89]. One way to help construct subpolicies
is to focus on discovering and reaching goals, which are specific states in the environment; they may often be locations, to
which an agent should navigate. Whether utilized with HRL or
not, the discovery and generalization of goals is also an important area of ongoing research [35], [66], [89].

Imitation learning and inverse RL
One may ask why, if given a sequence of "optimal" actions
from expert demonstrations, it is not possible to use supervised
learning in a straightforward manner-a case of "learning
from demonstration." This is indeed possible and is known as
behavioral cloning in traditional RL literature. Taking advantage of the stronger signals available in supervised learning problems, behavioral cloning enjoyed success in earlier

IEEE SIGNAL PROCESSING MAGAZINE

November 2017

Table of Contents for the Digital Edition of Signal Processing - November 2017