IEEE Computational Intelligence Magazine - November 2022 - 49

E2EBP. Therefore, we do not consider it as a training
method that works in the general set-up as the rest of the
methods in this paper do, and it is included here only for
completeness.
b) Optimality Guarantees: Optimality can be trivially guaranteed
assuming that the synthetic gradient models perfectly
perform their regression task, i.e., that they perfectly approximate
local gradients using module activations. However, this
assumption is almost never satisfied in practice.
c) Advantages:
❏ While Proxy Objective and Target Propagation methods
all help reduce computational load and memory usage,
Synthetic Gradients methods can enable parallelized training
of network modules. This can further reduce training
cost since update on a certain module does not need to
wait for those of other modules except when the synthetic
input or synthetic gradient models are being updated.This
advantage is shared by a variant of Proxy Objective [23].
And it was shown in [23] that this Proxy Objective variant
is much more performant than Synthetic Gradients.
❏ Synthetic Gradients can be used to approximate true backpropagation
through time (unrolled for an unlimited number
of steps) for learning recurrent networks. It was shown
in [16] that this allows for much more effective training for
learning long-term dependency compared to the usual
truncated backpropagation through time.
d) Current Limitations and Future Work:
❏ Similar to Target Propagation, the auxiliary models require
extra human and machine resources. And like Target Propagation,
there is no empirical evidence that Synthetic Gradients
can scale to challenging benchmark datasets or
competitive models.
❏ Synthetic Gradients methods do not enable truly weakly
modular training in general.
IV. Other Non-E2E Training Approaches
A.Methods Motivated Purely by Biological Plausibility
Arguably the most notable set of works left out in this survey
are those studying alternatives to E2EBP purely from the perspective
of biological plausibility (for a few examples, see [25],
[26], [33], [35], [67]). These training methods were generally
purely motivated by our understandings on how human brain
works, and were therefore claimed to be more " biological
plausible " than E2EBP. However, biological plausibility in
itselfdoes not lead to provable optimality. Despite the fact that
these methods are of great value from a biology standpoint,
they have been significantly outperformed by E2EBP on
meaningful benchmark datasets [63]. Below, arguably the
most popular family of biologically plausible alternatives to
E2EBP - Feedback Alignment - is discussed.
One major argument criticizing E2EBP'slackofbiological
plausibility states that E2EBP requires each neuron
to have precise knowledge of all of the downstream neurons
(end-to-end backward pass), whereas human brain is
not believed to demonstrate such precise pattern of reciprocal
connectivity [26]. This issue is known as the " weight
transport " problem [26]. To solve the weight transport
problem [26], proposed to use fixed, random weights in
place of actual network weights during backward pass,
breaking the symmetry between weights used during forward
and backward passes and thus solving the problem.
This eliminates the need for a true end-to-end backward
pass. And this family method is called Feedback Alignment
[67] proposed to use fixed, random weights that share signs
with the actual network weights [25] proposed two more
revisions of the original Feedback Alignment instantiation.
During the backward pass, instead of using the backpropagated
supervision (with random weights instead of real
weights) to provide gradient like the original instantiation
does, these alternative versions directly use error at the
output (with potential modulation by random matrices to
make the dimensionality match for each layer).
Suppose the model is written as fðx;W1;W2;W3Þ¼
s3ðW3s2ðW2s1ðW1xÞÞÞ ¼f3ðf2ðf1ðx;W1Þ;W2Þ;W3Þ, where
W1;W2;W3 are trainable weight matrices and s1; s2; s3
are activation functions. During any step in gradient descent,
suppose the forward pass has been done and let a3 ¼
fðx;W1;W2;W3Þ; a2 ¼ f2ðf1ðx;W1Þ;W2Þ; a1 ¼ f1ðx;W1Þ;
b2 ¼ W2a1, where W1;W2 are the current network weights,
E2EBP computes gradient inf1 as
@L
@a1
¼
@L
@a3
@a3
@a2
@s2
@b2
W2:
(22)
The original instantiation ofFeedback Alignment in [25] computes
this gradient with W2 substituted by some fixed, random
matrix B2. The variant proposed in [67] substitutes W2 with
fixed, random matrix B2 with the only constraint being that
each element ofB2 shares the same sign with the corresponding
element in W2. The study [25] proposed to compute this
gradient as
@L
@a1
¼
@L
@a3
@s2
@b2
C2;
(23)
where C2 is a fixed, random matrix with appropriate
dimensionality. The error at the output (@L=@a3), after modulation
by some random matrix C2, is used for training in place
ofbackpropagated supervision.
B.Auxiliary Variables
Another important family of E2EBP-free training methods is
the Auxiliary Variables methods [34], [40], [53], [54], [55],
[56], [58], [59], [60], [61], [62]. These methods introduce auxiliary
trainable variables that approximate the hidden activations
in order to achieve parallelized training. Despite the
strong theoretical guarantees, the introduced auxiliary variables
may pose scalability issues and, more importantly, these methods
require special, often tailor-made alternating solvers. And
NOVEMBER 2022 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 49
IEEE Computational Intelligence Magazine - November 2022

Table of Contents for the Digital Edition of IEEE Computational Intelligence Magazine - November 2022