IEEE Computational Intelligence Magazine - February 2023 - 87

direction in the gradient descent algorithm
by using a recurrent neural network
called long short-term memory
(LSTM) [41]. Wang et al. [42] used
LSTM to learn the hyper-parameters of
the commonly used training algorithm
ADAM [43], and Chen et al. [44] used
RNN to decide promising iterates for
derivative-free problems. Li et al. [15]
used RL to learn to optimize continuous
optimization problems.
The application oflearning-to-optimize
is still in its infancy. Sharma et al.
proposed using deep Q-learning to
adaptively select operators from a pool
of mutation operators in a hybrid DE
[45]. In preliminary work for this paper
[8], Q-learning was used to tune the
switching time of HSES. This along
with prevalent studies show that learning-to-optimize
can be advantageous
for the successful application of EAs in
the following ways: First, it can realize
the automatic tuning/control of the
structural parameters to not only significantly
reduce the amount of computational
resources required for tuning, but
also to improve the efficiency of tuning/control.
Second, it can inspire the
learning ofnew EAs.
C. Markov Decision Process and
Reinforcement Learning
Reinforcement learning (RL) is a key
technology in Artificial Intelligence. It
has been applied to solve different control
tasks, such as the game ofGO [46], Atari
games [47], and robot control [48]. It can
be modeled as a Markov decision process
(MDP) [49], which is defined by the
tuple ðS; A; m0;p; r; p; TÞ,where S
RD denotes the state space, A Rd the
action space, m0 the initial distribution of
the state, r 2 R the reward, and T the
time horizon. D (resp., d) is the number
of dimensions of the state space (resp.,
action space), and is problem dependent.
At each time step t, st 2S and at 2A are
the current state and the action,
HSES [17] is the winner algorithm in the CEC 2018
competition, in which univariate sampling and CMA-ES
are applied sequentially. ... In the following
discussion, how to control the switching time by
applying the proposed framework is presented.
respectively. The policy is then defined
as: p : SA! R,where pðatjst; uÞ is
the probability of choosing action at
when observing st with u as the parameter.
pðstþ1jat; stÞ is the transition
probability.
Figure 1 shows a finite-horizonMDP.
Starting from an initial state s0,an action
a0 is taken based on the policy pða0js0; uÞ.
s1 is observed according to the transition
probability pðs1ja0; s0Þ,and a reward r1 is
obtained. This procedure is repeated until
the horizon limit T is reached. The set
fs0; a0; r1; .. . ; aT1; rT; sTg is called a
trajectory.
The aim ofRL is to find an optimal
policy p such that the expectation of
the cumulative reward, i.e., RðtÞ¼
½
PT1
t¼0 gtrtþ1, is maximized, where g is
a constant that controls the time decay.
Many RL methods have been developed
to handle different environments, such
as Q-learning for discrete action and
state space, deep Q-Learning (DQL) for
discrete action and continuous state
space, and the policy gradient for continuous
action and state space [16]. Note
that Q-learning is applicable only to discrete
state spaces. Deep Q-learning
(DQL) has been proposed to deal with a
continuous state space [47]. A deep neural
network is applied to regress the discrete
state into a continuous one. The
details of Q-learning and DQL have
been provided in Supplementary Materials.
III. THE FRAMEWORK
A sequential hybrid EA is composed of
various EA phases. Each EA phase is
equipped with some computational
budgets. The timing to switch from one
EA phase to another is important to a
hybrid EA's performance. The switching
time can be considered as a structural
parameter. This paper focuses on hybrid
EAs with two EA phases and presents a
general framework in which an intelligent
agent is employed to control the
switching time.
The proposed framework is summarized
in Algorithm 1. First, new solutions
are generated by using the first EA (i.e.,
EA1ðÞ)where Q is the number offitness
evaluations used for implementing
EA1ðÞ. That is, the algorithm judges
whether to switch after every Q evaluations.
At each time t, G takes the current
population and the algorithmic parameters
Gt1 as input.1 It outputs newpopulationXt
and its function values Ft. Second,
information collected so far by function
Collect() (line 7) is summarized by function
RepresentðÞ (line 8) to obtain the
current state st. At last, the action at is
taken based on the learned network
Qðst; a; wÞ in line 9 (or Q-table Qðst; aÞ),
where a represents the action which takes
value from its domain A.In this paper
A¼f0; 1g. Depending on at,the search
process decides whether to switch from
EA1 to EA2 or not (line 11 to 14). If
switching happens, EA2 will be implemented
with left computing resources
maxNFEs tQ where maxNFEs is the
maximum number of fitness evaluation
for this hybrid algorithm. Thus T is
always set smaller than bmaxNFEs
Q c, so that
guaranteeing some computing resources
left for EA2. Notice that a DQN
Qðst; at; wÞ is used in Algorithm 1 to
represent knowledge. Substituting
Qðst; at; wÞ with a Q-table Qðst; atÞ can
obtain a framework based onQ-learning.
FIGURE 1 Illustration of a finite-horizon Markov decision process.
1Gt can be either time-invariant or variant. Timevariant
parameters might be updated by some
adaptive schemes, but not by Q-learning.
FEBRUARY 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 87
IEEE Computational Intelligence Magazine - February 2023

Table of Contents for the Digital Edition of IEEE Computational Intelligence Magazine - February 2023