Computational Intelligence - February 2017 - 44

many RL systems, LCS were also originally applied to sequential learning
problems, in which an agent interacts
with its environment and learns through
trial and error and through receiving
feedback from the environment. Second,
the rule evaluation in LCS borrows
heavily from the RL literature. For
instance, Holland used bucket-brigade-an
epochal RL algorithm-to measure the
goodness of classifiers in the early
implementations of classifier systems
[35]. Later incarnations of LCS leveraged the contemporary developments in
RL research and borrowed techniques
that were more advanced, such as temporal-difference learning [36] and
Q-learning [37], to evaluate rule quality.
The goal of RL algorithms, in general,
is to find an optimal policy and its valuation to guide an agent's behavior in an
environment, based on the feedback or
a reward signal received from the environment. Often, neural networks are
used to approximate value functions
when an exhaustive policy search is not
practical, as in the case of high-dimensional or continuous problems. From
that perspective, LCS provide an alternative RL mechanism that represents
both the policy in an interpretable rule
format as well as a rule-based approximation to value functions. Finally, the
LCS framework allows for learning the
policy in an online fashion, making it
especially suitable for real-time applications and many computer games. See
[38], [39] for a detailed comparison of
the two approaches.
Traditionally, LCS models have been
classified into two broad categories: the
systems that were developed following
Holland's work, known as Michiganstyle LCS; and the approach popularized
by De Jong and Smith [40], [41] known
as Pittsburgh-style LCS. A major difference between the two approaches
emerged based on their solution representation. In Michigan-style LCS, a classifier consists of a single conditionaction-rule and a set of parameters that
keep track of the rule's quality and other
statistics. Subsequently, a classifier in
these systems contributes partially to the
entire solution or the knowledge base.

A complete solution is represented by
the set of all classifiers or the entire rule
population. Michigan-style LCS are
incremental learners; that is, they build
their models online during interaction
with the environment. Subsequently, the
rule discovery algorithm, generally a
GA, works in a non-generational fashion and often in the niches defined by
the matching rule conditions to an environmental state or input.
A major distinction between different types of Michigan-style LCS is
made based on how the fitness of classifiers is calculated in the system. The
strength-based LCS relate a classifier's
fitness directly to the accumulated payoff or reward it receives from the environment for its advocated action. In
contrast, the accuracy-based LCS compute a classifier's fitness based on the
accuracy of reward prediction, thereby
allowing the system to evolve a complete action map of the problem. This
subtle difference in fitness calculation
has shown to have a significant effect
on system performance and the evolutionary dynamics [42]. Holland's original implementation [35], known as
CS-1, was based on this latter approach.
However, in his revised implementation
[43], which became known as the payoff-based LCS, he replaced the accuracybased fitness with the strength-based
fitness among other changes. Several
LCS variants were later introduced,
building on Holland's work (see [20]
for a recent historical review). However,
after the initial appreciation of Holland's work, LCS largely remained disregarded until Stewart Wilson's work in
1990s revived the field.
Wilson, leveraging the advancements
in the RL literature, introduced two different variants of LCS: the infamous
ZCS [23] and the XCS [44]. XCS, in
particular, incorporated a purely accuracy-based fitness computation and a
Q-learning-based fitness update mechanism, among other features. These
changes helped overcome some key performance bottlenecks in the payoffbased LCS, such as the proliferation of
over-general classifiers in the population.
XCS rejuvenated the field of LCS

IEEE ComputatIonal IntEllIgEnCE magazInE | FEbruary 2017

research and became the most popular
and successful LCS. This current survey
corroborated this observation and identified XCS as the most widely adopted
LCS in the games literature, followed by
Holland's LCS, ZCS, and a small number of uses of Pittsburgh-style LCS.
Below, we provide a brief explanation of
XCS, focusing on its main algorithmic
components and its working, using a
simple game example.
Figure 2 depicts a working cycle of
an XCS. In a Tic-tac-toe game, for
instance, an input to the system might
consist of the current board configuration. A ternary vector of length nine
could be used to model this environment. Here, the vector length corresponds to the number of boxes or
positions on the board, which could be
either empty or containing one of the
two relevant symbols (i.e., a naught or a
cross). A ternary set (e.g., 0, 1, 2) could
be used to map the three possible values
for each position on the board. Subsequently, a quaternary vector of length
nine could be used to represent the
condition part in a classifier, where the
extra literal allows generalization using a
don't-care symbol (e.g., #). The action
part of the classifier could be represented using a single integer, specifying
the next move ([1-9]).That is, for placing
a naught or a cross (depending upon
which symbol is allocated to either
competitor) on one of the available
board slots, the population of classifiers
([P]) could be initialized randomly to a
fixed size or incrementally, using the
covering operator. Upon receiving an
input, a matchset ([M]) is formed by all
classifiers in the population that match
the current input. A matching function
can be defined based on an exact match
or using some distance criterion [45].
The covering operator is activated if no
matching classifiers are found in the
existing population.
The covering creates one or more
classifiers matching the current input,
with a given probability of introducing
don't-cares in the condition. Next, an
action to be executed is selected from all
advocated actions by the classifiers in
[M]. A commonly used scheme is to

Table of Contents for the Digital Edition of Computational Intelligence - February 2017