IEEE Computational Intelligence Magazine - August 2019 - 11

and fully connected (FC) layers perform compuμRTS contains most features of a standard RTS game,
tations that depend on the activations in the
input volume as well as the parameters (neurons
while at the same time keeping things simple. The
weights and biases). These parameters are trained
few unit and building types available have the same
with gradient descent such that the class scores
size, and there is only one resource type.
that are the output of the CNN are consistent
with the labels in the training data. Other layers,
such as (Leaky) Rectified Linear Unit ((L)ReLU)
experience replay and improve learning stability, although at
layers and pooling (POOL) layers, implement fixed functions
the cost of being less sample efficient. For more details we refer
and do not have trainable parameters. Figure 1 shows a simple
interested readers to surveys on deep reinforcement learning
CNN architecture for analyzing small RTS game maps, annosuch as [26].
tated with short descriptions for each layer. The layers we use
in this paper are:
Input: the game state for 8 × 8 to 128 × 128 maps in sevIII. Learning State Evaluation
eral channels-or planes-similar to the 3 RGB channels of
and Strategy Selection in μRTS
an image. Each of these planes will correspond to different
In this section, we demonstrate how supervised machine learngame features, such as unit type, health, or terrain.
ing techniques can be used to estimate state values and learning
Convolution: composed of a set of learnable filters, each
whole-game playing strategies. We also show how strategy
connecting to only a local (across width and height) neighlearning and tactical search can be combined to achieve greater
borhood, that extends through the full depth dimension.
playing performance. Finally, we describe an attempt to lift the
LReLU: leaky rectified linear units compute the following
dependency on labelled data by means of RL, which-albeit
piece-wise linear function: f (x) = ax if x 1 0 and x otherwise.
eventually failing-inspired us to investigate learning in RTS
games without forward models and at a smaller problem scale
Fully Connected Layer: every input is connected to every
(combat), which will be reported in Section IV.
output node.
Global Average: value average across a plane. No learned
weights.
A. μRTS
Softmax: non-linearly normalizes the output vector so
In the experiments reported later, we use μRTS [3], a simple
that each value is between 0 and 1, and all values add up to 1
RTS game designed for testing AI techniques. μRTS provides
the essential features of an RTS game: it supports four unit and
two building types, all of them occupying one tile each, and
D. Neural Network Training
there is only one resource type. μRTS supports configurable
In the original 2D image recognition tasks to which CNNs
map sizes, commonly ranging from 8 × 8 to 128 × 128 tiles,
were first applied, supervised learning was used on labeled
and full observability is an option we have chosen for our
images. Training was accomplished by stochastic gradient
experiments. The μRTS software repository features a few basic
descent to minimize a loss function. In multi-step tasks, in
scripted players, as well as search-based players implementing
which labeling is harder, RL [24] can be used. In RL, an agent
several state-of-the-art RTS search techniques [8], [12], [27]-
iteratively observes the environment, takes an action, and
making it a useful tool for benchmarking new AI algorithms.
observes a reward for that action. RTS game domains are hard
RL problems because the environment is partially observable,
action results can be stochastic, and rewards are delayed.
B. Supervised State Value Learning
To train our networks we use supervised learning and stanEvaluation (or value) functions are commonly used in gamedard RL algorithms such as Deep Q-Network (DQN) learnplaying programs to estimate the value of a position for a given
ing [22] and the Asynchronous Advantage-Actor Critic (A3C)
player. They map a state to a single number, usually representing
algorithm [25]. DQN is a sample efficient algorithm that uses a
either the probability of winning or the expected payoff differreplay buffer to store past experiences (tuples containing a state,
ence between the players. The purpose of the neural network
action chosen, reward received, and next state). Batches of ranwe describe here is to approximate the value function v(S),
dom-and hopefully uncorrelated-experiences are drawn
which represents the win-draw-loss outcome of the game
from the buffer and used for updates, forcing the network to
starting in state S.
generalize beyond what it is currently doing in the environment. In A3C multiple agents interact with the environment
1) Architecture
independently and simultaneously, and the diverse experiences
A common network input format inspired by AlphaGo's
are used to update a global, shared neural network. The main
design [21] is a set of 2-dimensional binary planes that repreadvantage of A3C over DQN is its capacity for mass parallelizasent the state. Table I shows a μRTS example. The first six
tion, which reduces training time in a nearly linear fashion
planes register the positions of all units. Each plane contains a
w.r.t. the number of agents working in parallel [25]. The agents'
1 where there is a unit of a particular type, and 0 s everywhere
uncorrelated experiences serve a similar function as the DQN
else. The next five layers contain a 1-hot encoding of the unit's

AUGUST 2019 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

IEEE Computational Intelligence Magazine - August 2019

Table of Contents for the Digital Edition of IEEE Computational Intelligence Magazine - August 2019