IEEE Computational Intelligence Magazine - August 2018 - 49

reward signal is more adaptive and very
popular in system stabilization. Recently,
a goal representation heuristic dynamic
programming (GrHDP) method [23],
[24] has been proposed in the literature.
By introducing an internal reward signal,
this method can automatically and adaptively provide information to the agent
instead of hand-crafting. Generally, all
these existing approaches require the
environment to define the reward signal
with a priori knowledge or domain
expertise. This can be seen as an external
teacher or trainer who provides the
reward value or function for each action
performed by the agent. However, what
if this kind of teacher is unavailable or
he/she cannot provide such a direct
feedback to the agent for some reasons?
Can the agent learn by itself in this case
under the RL framework?
In this article, by considering the
relationship between the goal and the
system states and actions, we propose a
method that enables the agent to learn
only with the ultimate goal but no
explicit external reward signals. To this
end, the key contribution of this work is
the development of a self-learning
approach based on the ultimate goal,
rather than obtaining the external supervision (reward signals) directly from the
environment. We further develop the
computational formulation of this selflearning idea based on a specific ADP
design, and validate its performance and
effectiveness based on a triple-link inverted pendulum case study. We would like
to note that this article focuses on the
self-learning of an agent's own goals, in
contrast to observational learning or
apprenticeship learning about other's
goals such as in inverse RL [25].
II. The Key Idea:
Self-Learning Design

In contrast to the traditional RL/ADP
design, in which a specific reward signal
passes from the environment to the agent
to tell the effects ("good" or "bad"), our
proposed self-learning ADP method
enables the agent to establish the reward
signal itself, which is called the internal
reward signal in this article.The comparison
of the agent-environment interaction in

traditional and self-learning ADP design is
described in Fig. 1. We can observe that
instead of receiving the immediate reward
signal from the environment (Fig. 1(a)), the
agent in the self-learning ADP method,
estimates an internal reward s (t) to help
achieve the goal (Fig. 1(b)). Hence, the
communications between the environment and the agent at each time step are
only the states and actions, which is fundamentally different to the existing RL and
ADP methods.
Note that, in the traditional RL/ADP
design, the use of the reward signal is to
define the agent's goal of a task. However,
in the self-learning ADP method, the
reward signal is unavailable in the interaction. Since the reward signal reflects the
evaluation of an action's effect, which is
always paired with the goal, the agent in
this learning process needs to learn what
the reward signal is according to the ultimate goal. The effect of the action is
compared to the goal within a common
reference frame in order to assess the
achievement [26]. The agent then learns
how "good" or "bad" the action is at each
time step by itself via the guidance from
the ultimate goal. After that, based on the
estimated internal reward signal and the
system state, the agent generates the control action. This is to say, in order to
achieve the ultimate goal, instead of
learning to make the decision directly, the
agent needs first to learn what the best
reward signal is to represent the information upon which to base a certain action.
More specifically, the interaction
between the agent and the environment

happens sequentially, in discrete time
steps. At each time step t, the agent
selects an action a (t) according to the
representation of the environment x (t) .
In consequence, the agent finds itself in a
new state x (t + 1) and then estimates
the corresponding internal reward signal s (t + 1) at next time step. For example, when we train a battery-charged
robot to collect trash, the robot decides
at each time step whether it should
move forward to search more trash or
find its way back to its battery charger.
Its decision is based on its position and
speed. As a consequence of its action and
state, the robot estimates the reward signal s (t) = f (x (t), a (t)) . Initially, the robot
randomly assigns a value s (0) since no
prior knowledge is available about what
to do. However, after trial-and-error
learning, we want the robot to learn how
to represent an action in a given state.
Let us stipulate that the agent's goal is
to maximize the reward it estimates in
the long term. If the reward sequence
after time step t is as s (t + 1), s (t + 2),
s (t + 3), f, then the value function can
be described as
V (t) = s (t) + cs (t + 1) + c 2 s (t + 2)
+ c 3 s (t + 3) + g
(1)
= s (t) + cV (t + 1)
where 0 1 c 1 1 is the discount factor.
The discount factor determines if an
immediate reward is more valuable than
the rewards received in the far future. If
c = 0, the value function is equal to the

Traditional RL/ADP
State
x (t )

Self-Learning RL/ADP
State
x (t )

Agent

Reward
r (t )

Internal
Agent Reward
s (t )

Action
a (t )

Action
a (t )

Environment

Environment

(a)

(b)

Figure 1 The conceptual diagram of agent-environment interaction: (a) Traditional RL/ADP
design: the external reinforcement signal r (t) passing from the environment to the agent; (b)
Self-learning RL/ADP design: no external reinforcement signal needed during this process, and
the agent estimating an internal reward signal s (t) itself.

auguSt 2018 | IEEE ComputatIonal IntEllIgEnCE magazInE

49

Table of Contents for the Digital Edition of IEEE Computational Intelligence Magazine - August 2018

Contents
IEEE Computational Intelligence Magazine - August 2018 - Cover1
IEEE Computational Intelligence Magazine - August 2018 - Cover2
IEEE Computational Intelligence Magazine - August 2018 - Contents
IEEE Computational Intelligence Magazine - August 2018 - 2
IEEE Computational Intelligence Magazine - August 2018 - 3
IEEE Computational Intelligence Magazine - August 2018 - 4
IEEE Computational Intelligence Magazine - August 2018 - 5
IEEE Computational Intelligence Magazine - August 2018 - 6
IEEE Computational Intelligence Magazine - August 2018 - 7
IEEE Computational Intelligence Magazine - August 2018 - 8
IEEE Computational Intelligence Magazine - August 2018 - 9
IEEE Computational Intelligence Magazine - August 2018 - 10
IEEE Computational Intelligence Magazine - August 2018 - 11
IEEE Computational Intelligence Magazine - August 2018 - 12
IEEE Computational Intelligence Magazine - August 2018 - 13
IEEE Computational Intelligence Magazine - August 2018 - 14
IEEE Computational Intelligence Magazine - August 2018 - 15
IEEE Computational Intelligence Magazine - August 2018 - 16
IEEE Computational Intelligence Magazine - August 2018 - 17
IEEE Computational Intelligence Magazine - August 2018 - 18
IEEE Computational Intelligence Magazine - August 2018 - 19
IEEE Computational Intelligence Magazine - August 2018 - 20
IEEE Computational Intelligence Magazine - August 2018 - 21
IEEE Computational Intelligence Magazine - August 2018 - 22
IEEE Computational Intelligence Magazine - August 2018 - 23
IEEE Computational Intelligence Magazine - August 2018 - 24
IEEE Computational Intelligence Magazine - August 2018 - 25
IEEE Computational Intelligence Magazine - August 2018 - 26
IEEE Computational Intelligence Magazine - August 2018 - 27
IEEE Computational Intelligence Magazine - August 2018 - 28
IEEE Computational Intelligence Magazine - August 2018 - 29
IEEE Computational Intelligence Magazine - August 2018 - 30
IEEE Computational Intelligence Magazine - August 2018 - 31
IEEE Computational Intelligence Magazine - August 2018 - 32
IEEE Computational Intelligence Magazine - August 2018 - 33
IEEE Computational Intelligence Magazine - August 2018 - 34
IEEE Computational Intelligence Magazine - August 2018 - 35
IEEE Computational Intelligence Magazine - August 2018 - 36
IEEE Computational Intelligence Magazine - August 2018 - 37
IEEE Computational Intelligence Magazine - August 2018 - 38
IEEE Computational Intelligence Magazine - August 2018 - 39
IEEE Computational Intelligence Magazine - August 2018 - 40
IEEE Computational Intelligence Magazine - August 2018 - 41
IEEE Computational Intelligence Magazine - August 2018 - 42
IEEE Computational Intelligence Magazine - August 2018 - 43
IEEE Computational Intelligence Magazine - August 2018 - 44
IEEE Computational Intelligence Magazine - August 2018 - 45
IEEE Computational Intelligence Magazine - August 2018 - 46
IEEE Computational Intelligence Magazine - August 2018 - 47
IEEE Computational Intelligence Magazine - August 2018 - 48
IEEE Computational Intelligence Magazine - August 2018 - 49
IEEE Computational Intelligence Magazine - August 2018 - 50
IEEE Computational Intelligence Magazine - August 2018 - 51
IEEE Computational Intelligence Magazine - August 2018 - 52
IEEE Computational Intelligence Magazine - August 2018 - 53
IEEE Computational Intelligence Magazine - August 2018 - 54
IEEE Computational Intelligence Magazine - August 2018 - 55
IEEE Computational Intelligence Magazine - August 2018 - 56
IEEE Computational Intelligence Magazine - August 2018 - 57
IEEE Computational Intelligence Magazine - August 2018 - 58
IEEE Computational Intelligence Magazine - August 2018 - 59
IEEE Computational Intelligence Magazine - August 2018 - 60
IEEE Computational Intelligence Magazine - August 2018 - 61
IEEE Computational Intelligence Magazine - August 2018 - 62
IEEE Computational Intelligence Magazine - August 2018 - 63
IEEE Computational Intelligence Magazine - August 2018 - 64
IEEE Computational Intelligence Magazine - August 2018 - 65
IEEE Computational Intelligence Magazine - August 2018 - 66
IEEE Computational Intelligence Magazine - August 2018 - 67
IEEE Computational Intelligence Magazine - August 2018 - 68
IEEE Computational Intelligence Magazine - August 2018 - 69
IEEE Computational Intelligence Magazine - August 2018 - 70
IEEE Computational Intelligence Magazine - August 2018 - 71
IEEE Computational Intelligence Magazine - August 2018 - 72
IEEE Computational Intelligence Magazine - August 2018 - 73
IEEE Computational Intelligence Magazine - August 2018 - 74
IEEE Computational Intelligence Magazine - August 2018 - 75
IEEE Computational Intelligence Magazine - August 2018 - 76
IEEE Computational Intelligence Magazine - August 2018 - Cover3
IEEE Computational Intelligence Magazine - August 2018 - Cover4
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202311
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202308
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202305
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202302
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202211
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202208
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202205
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202202
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202111
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202108
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202105
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202102
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202011
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202008
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202005
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_202002
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201911
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201908
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201905
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201902
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201811
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201808
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201805
https://www.nxtbook.com/nxtbooks/ieee/computationalintelligence_201802
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_winter17
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_fall17
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_summer17
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_spring17
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_winter16
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_fall16
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_summer16
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_spring16
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_winter15
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_fall15
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_summer15
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_spring15
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_winter14
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_fall14
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_summer14
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_spring14
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_winter13
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_fall13
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_summer13
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_spring13
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_winter12
https://www.nxtbook.com/nxtbooks/ieee/computational_intelligence_fall12
https://www.nxtbookmedia.com