IEEE Computational Intelligence Magazine - November 2019 - 23

2.1.1. Single-Agent
Reinforcement Learning
In [8], Boyan et al. proposed the Q-routing algorithm for
optimizing packet routing control. In the Q-routing algorithm, each router updates its policy according to its Q-function based on local information and communication. The
experiments showed that Q-routing offered more efficient
performance than the nonadaptive shortest path algorithm,
especially under a high workload. In [9], Choi et al. proposed
a memory-based Q-learning algorithm called predictive
Q-routing to increase the learning rate and convergence
speed by retaining past experiences. In addition, in [10],
Kumar et al. proposed dual reinforcement Q-routing (DRQrouting), which uses information gained through backward
and forward exploration to accelerate the convergence speed.
In [11], [12], Reinforcement Learning (RL) was successfully
applied in wireless sensor network routing, where the sensors
and sink nodes could self-adapt to the network environment.
However, in a multiagent system, single-agent RL suffers
from severe non-convergence. Instead, applying multiagent
RL to improve the cooperation among network nodes is
more feasible, and there have been a series of works on MLdriven routing based on multiagent RL.
2.1.2. Multiagent Reinforcement Learning
In [13], [14], Stone et al. proposed the Team-Partitioned
Opaque-Transition RL (TPOT-RL) routing algorithm, which
allows a team of network nodes working together toward a
global goal to learn how to perform a collaborative task. In
[15], Wolpert et al. designed a sparse reinforcement learning
algorithm named the Collective Intelligence (COIN) algorithm, in which a global function is applied to modify the
behavior of each network agent. In contrast, the author of [16]
proposed a Collaborative RL (CRL)-based routing algorithm
with no single global state. The CRL approach was also successfully applied for delay-tolerant network routing in [17].
However, in an inherently distributed system, state synchronization among all routers is extremely difficult, especially with
increasing network size, speed, and load. With the development
of SDN technology, centralized AI-driven routing strategies
have received considerable attention.

For each flow, the controller updated the optimal routing strategy based on the QoS requirements and issued the forwarding
table to each node along the forwarding path. In [20], Wang et
al. proposed a RL-based routing algorithm for Wireless Sensor
Networks (WSNs) named AdaR. In AdaR, Least-Squares Policy Iteration (LSPI) is implemented to achieve the correct tradeoff among multiple optimization goals, such as the routing
path length, load balance, and retransmission rate. However, the
overhead incurred for centralized AI control is high.
3. AI-Driven Network Routing

In this section, we first propose a three-layer logical functionality
architecture for AI-driven networking.Then, we discuss the problem of how far away the intelligent control plane can be located
from the forwarding plane ("centralized" or "distributed").
3.1. Closed-Loop Control Paradigm

In a traditional network, the network layer functionality can
be divided into the forwarding plane and the control plane.
However, with the introduction of AI&ML, this two-layer
architecture cannot effectively describe the logic of intelligent system operation. In this paper, inspired by the closedloop mechanism of the learning process of the human brain
("observation - judgment - action - learning"), we split the
functionality of AI-based networking into three layers to

Intelligent
Control Plane

2.1. Decentralized Routing

In recent years, with the great success of machine
learning, applications of Artificial Intelligence and
Machine Learning (AI&ML) in networking have
received considerable attention.

Load Balance
Policy
Segment
Routing

Queue
Management

Observation
Awareness
Plane

can be traced back to the 1990s. In this section,
we review the related work on AI-driven network routing algorithms.

Delay

Data Mining
Throughput
Packet Loss

Elephant
Flow

Mice
Flow

Action
QoS

In [18], Stampa et al. proposed a deep RL (DRL) algorithm for
optimizing routing in a centralized knowledge plane. Benefiting from the global control perspective, the experimental
results showed very promising performance. In [19], Lin et al.
applied the SARSA algorithm to achieve QoS-aware adaptive
routing in multilayer hierarchical software-defined networks.

Forwarding
Plane

Monitor Data

2.2. Centralized Routing

Packets In

Packets Out
Packet
Forwarding
Engine

Packet
Forwarding
Engine

FIGURE 1 The closed-loop control paradigm.

NOVEMBER 2019 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

IEEE Computational Intelligence Magazine - November 2019

Table of Contents for the Digital Edition of IEEE Computational Intelligence Magazine - November 2019