IEEE Computational Intelligence Magazine - August 2019 - 14

believe there are two issues that may complicate
scripted strategy learning in full RTS games. First
of all, the rewards are very sparse as they are only
generated when a match ends-often after thousands of simulation frames. Secondly, the choice
of actions near the end of the game is mostly
inconsequential, because any action (script
choice) will be able to successfully finish a game
that is very nearly won, and conversely, if the
position is very disadvantageous, no script choice
will be able to reverse it. Moreover, when action choices matter
most, at the beginning of the game, the virtual rewards generated by the value network are very small and noisy. We attempted
to overcome this complication by starting the learning process
on using endgame scenarios and slowly adding earlier states
until full games were played, but without success. We intend to
revisit this anomaly in future work.

We will define a script as a hand-coded function that
takes a game state, and returns a set of actions to be
performed. This script could be built with any of the
standard commercial game AI techniques, such as
Finite-State Machines (FSM), Behavior Trees (BT) or
Utility Systems.
with Puppet Search applied to μRTS. The strength of the AI
player using the network is close to one using search, while
using only a fraction of the time and not requiring a forward
model that allows us to accurately simulate action effects.
D. Combining Learned Strategy Selection
with Tactical Search

A proposed technique for reducing search complexity in RTS
games is to use action and/or state abstractions [8] which capture essential game properties. High-level abstractions, like
build orders, can often lead to good strategic decision making,
but tactical decision quality (i.e. which concrete action to take
at a specific point in time) may suffer due to lost details. A
competing method is to sample the search space [12] which
often leads to good tactical performance in simple scenarios,
but poor strategic planning. So, why not combine both ideas to
generate tactically sound strategic actions?
For this purpose, we conducted experiments in μRTS
(which allows look-ahead search) comparing the playing
strength of a fully search-based agent with a hybrid player
based on a policy network and tactical search [27]. The searchbased agent chooses among high-level scripts and evaluates
script choice sequences tactically using an MCTS variant. The
hybrid player uses a policy network to select a script to execute
next and utilizes tactical MCTS to execute the selected script.
The policy network was trained using the search-based agent as
described above. It performs slightly worse than the search
algorithm but selects scripts much faster. The gained time can
then be used to refine script actions tactically. In a round-robin
tournament between these two algorithms, with 60 different
starting positions per match-up, the hybrid player won 56.7%
of the games against the search-based one. The overall winning
rate of the hybrid and search-based players was 88.3% and
84.0% respectively. The full experimental details and results are
reported in [27].
E. Reinforcement Learning Attempt

In an effort to eliminate our need for a forward model to label
data based on search we implemented double DQN [29] with
experience replay [22] and used our value network to provide
a virtual reward in addition to the final game result. However,
in all of our experiments the network converged to always
selecting the script with the best average performance, regardless of the game state. Apart from possible implementation
issues and parameter updates getting stuck in local extrema, we

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2019

IV. Case Study: Using Deep RL in Total War: Warhammer

The methods presented in the previous section are based on
search algorithms and thus have the drawback of requiring forward models, which rarely are readily available, or easy to
design and adjust. In this section, we present a case-study that
uses RL as an end-to-end approach that does not rely on a forward model and apply it to a modern video game-Total War:
Warhammer (TW). Compared to μRTS and even StarCraft,
battles in TW pose more difficult challenges because complex
movement and targeting is affected by additional game
mechanics: units can get tired, a defender's orientation with
respect to its attacker is crucial, there are effects that reward
charging an enemy but discourage prolonged melee (cavalry)
or the other way around (pike infantry), and morale is often
more important for achieving victory than troop health. As a
consequence, strategies such as pinning, refusing a flank while
trying to overpower the other flank, and rear attacks on the
enemy are required. Cooperation between units is much more
important than in games such as StarCraft, and more complex
behaviors have to be learned to defeat even the weak built-in
game AI.
A screenshot of a typical TW battle can be seen in Figure 4.
The goal of this internship project at Creative Assembly was to
learn control policies for agents in cooperative/competitive
environments such as TW battles. A diverse set of behaviors is
required, such as: deciding when to pull out from a melee fight
and when to switch targets, positioning of the ranged units,
avoid crossing paths of allied units, maintaining a coordinated
unit front when approaching the enemy, or pinning enemies
while creating superiority elsewhere. RL, in particular, represents a natural fit to learn adaptive, autonomous and selfimproving behavior in a multi-agent setting. CNNs have made
it possible to extract high-level features from raw data, which
enabled RL algorithms such as Q-learning to master difficult
control policies without the help of hand-crafted features and
with no tuning of the architecture or hyper-parameters for
specific games [30].

IEEE Computational Intelligence Magazine - August 2019

Table of Contents for the Digital Edition of IEEE Computational Intelligence Magazine - August 2019