IEEE Solid-States Circuits Magazine - Fall 2022 - 19

The Quest for the Most Efficient
Processor Core
Whether running ML workloads at
the extreme edge, the edge, or the
cloud, the network's execution efficiency
is often benchmarked in
terms of energy (typically expressed
in joules per inference) and latency
(expressed in μsecs per inference
or clock cycles per inference). Hardware
designers also translate these
characteristics into more dedicated
hardware metrics, such as an ML processor's
throughput efficiency (often
expressed in TOPS ∕mm2) or compute
energy efficiency (often expressed
in TOPS/W) [2]. Custom
ML processor architects strive
to maximize these metrics
within a given area or cost
budget. In the early days of
custom ML processor development,
these optimizations
focused on single
ML processor cores.
Looking at the architecture
template of a typical
ML processor (Figure 2),
the first requirement is
the realization of compact
and energy-efficient
data processing elements.
To this end, researchers
actively explored multiplyaccumulate
(MAC) arrays with
reduced or variable precision
compute elements in the digital or
even analog domain [3], [12], [13],
[16], [24]. The exploitation of structured
or unstructured sparsity in the
weight kernels and/or activations
allowed researchers to further improve
MAC efficiency by skipping redundant
computations [21], [30].
While this led to excellent compute
efficiency of the datapath, it
quickly became apparent that the required
movement of data to and from
these compute elements can easily
deteriorate the complete system efficiency.
Every ML processor developed
in the last decade, therefore, has also
been strongly optimized to limit the
cost to access all data necessary to
keep the compute elements busy. As
can be seen in Figure 2 (left), e.g., a
convolution (CONV) operation, typical
for deep NNs, can be represented
as a set of nested for-loops, in which
the same input and weight values are
common across many inner loop operations.
This compute pattern allows
one to share input data or aggregate
output data across multiple operators
to reduce the required number of
memory fetches and stores. Different
accelerator implementations exploit
this through a combination of spatial
data reuse and temporal data reuse.
Spatial data reuse denotes the reuse
of data across different compute
elements within the same clock cycle.
The multidimensional compute fabric
can, e.g., share input operands
among the MAC elements of one or
more dimensions of the array. Such
multicast hardware connectivity will
allow the scheduler to spatially unroll
for-loops along that dimension
of the MAC array to reduce the necessary
memory bandwidth of the MAC
array. It is thereby important to realize
that the multicast connectivity on
the input (respectively weight) lines
allows one to spatially unroll only
those for-loops that do not influence
the corresponding input (respectively
weight) indices.
In the example of Figure 2 (right),
input data are reused along the horizontal
dimension of the MAC array,
making the input memory bandwidth
proportional to the number of MAC
rows instead of to the number of total
MAC elements in the array. This, e.g.,
allows one to spatially unroll the channel
dimension (K) along the different
columns of the MAC array as the input
indices are not influenced by K. Likewise,
spatial output accumulation can
take place in adder trees before storing
the layer results back into memory,
hence reducing output memory
bandwidth. In the example of Figure 2
(right), this is done along the vertical
array dimension, allowing one to unroll
the input channel dimension (C)
along the MAC array rows. Such within-clock
cycle spatial reuse is indicated
by the parfor-loops in Figure 2 (left).
A second type of data reuse, known
as temporal data reuse, exploits data
reuse across multiple clock cycles.
Here, the compute is temporally
scheduled in such a way that either
the output (output stationary) or one
of the two inputs (input or weight
stationary) can stay in place toward
future computations. The example of
Figure 2 (right) realizes output stationarity
at the level of the MAC array as
the lowest three temporal loops (Fy,
Fx, and C) are all irrelevant to the output
indices. The outputs can hence be
accumulated locally across 3 × 3 × 16
clock cycles before being sent back to
the SRAM. Stationarity can not only
provide benefits at the lowest MAC/
register level but can be exploited hierarchically
at each memory level.
The temporal loop representation
of Figure 2 (left) indicates at which
memory level the weight/input/output
data tiles corresponding to each
of the for-loops are stored. From such
mapping, stationarity can be derived
for each memory level by analyzing
the inner temporal loop(s) of the
level above the memory boundary.
For example, in the case of Figure 2,
the lowest level for-loops above the
SRAM are OY and OX loops, irrelevant
to the weight indices. This schedule
hence can exploit temporal weight
reuse (stationarity) at the SRAM level,
when the loop execution is iterating
through the off-chip storage loops
(OX and OY loop).
All these techniques limit the
amount of data fetches required to
execute a specific NN layer. Such data
reuse drastically lowers bandwidth
needs, and hence, improves energy
efficiency and reduces latency losses
from memory-induced stalls.
Peak Performance Is
Not What Matters
The previous techniques help to
push processor peak performance
under realistic memory bandwidth
constraints. Yet the latency and energy
efficiency of running a complete
ML workload on a processor
core are not solely a function of the
core architecture itself. Effective
performance also strongly depends
on the deployed dataflow, i.e., on
IEEE SOLID-STATE CIRCUITS MAGAZINE
FALL 2022
19

IEEE Solid-States Circuits Magazine - Fall 2022

Table of Contents for the Digital Edition of IEEE Solid-States Circuits Magazine - Fall 2022

Contents
IEEE Solid-States Circuits Magazine - Fall 2022 - Cover1
IEEE Solid-States Circuits Magazine - Fall 2022 - Cover2
IEEE Solid-States Circuits Magazine - Fall 2022 - Contents
IEEE Solid-States Circuits Magazine - Fall 2022 - 2
IEEE Solid-States Circuits Magazine - Fall 2022 - 3
IEEE Solid-States Circuits Magazine - Fall 2022 - 4
IEEE Solid-States Circuits Magazine - Fall 2022 - 5
IEEE Solid-States Circuits Magazine - Fall 2022 - 6
IEEE Solid-States Circuits Magazine - Fall 2022 - 7
IEEE Solid-States Circuits Magazine - Fall 2022 - 8
IEEE Solid-States Circuits Magazine - Fall 2022 - 9
IEEE Solid-States Circuits Magazine - Fall 2022 - 10
IEEE Solid-States Circuits Magazine - Fall 2022 - 11
IEEE Solid-States Circuits Magazine - Fall 2022 - 12
IEEE Solid-States Circuits Magazine - Fall 2022 - 13
IEEE Solid-States Circuits Magazine - Fall 2022 - 14
IEEE Solid-States Circuits Magazine - Fall 2022 - 15
IEEE Solid-States Circuits Magazine - Fall 2022 - 16
IEEE Solid-States Circuits Magazine - Fall 2022 - 17
IEEE Solid-States Circuits Magazine - Fall 2022 - 18
IEEE Solid-States Circuits Magazine - Fall 2022 - 19
IEEE Solid-States Circuits Magazine - Fall 2022 - 20
IEEE Solid-States Circuits Magazine - Fall 2022 - 21
IEEE Solid-States Circuits Magazine - Fall 2022 - 22
IEEE Solid-States Circuits Magazine - Fall 2022 - 23
IEEE Solid-States Circuits Magazine - Fall 2022 - 24
IEEE Solid-States Circuits Magazine - Fall 2022 - 25
IEEE Solid-States Circuits Magazine - Fall 2022 - 26
IEEE Solid-States Circuits Magazine - Fall 2022 - 27
IEEE Solid-States Circuits Magazine - Fall 2022 - 28
IEEE Solid-States Circuits Magazine - Fall 2022 - 29
IEEE Solid-States Circuits Magazine - Fall 2022 - 30
IEEE Solid-States Circuits Magazine - Fall 2022 - 31
IEEE Solid-States Circuits Magazine - Fall 2022 - 32
IEEE Solid-States Circuits Magazine - Fall 2022 - 33
IEEE Solid-States Circuits Magazine - Fall 2022 - 34
IEEE Solid-States Circuits Magazine - Fall 2022 - 35
IEEE Solid-States Circuits Magazine - Fall 2022 - 36
IEEE Solid-States Circuits Magazine - Fall 2022 - 37
IEEE Solid-States Circuits Magazine - Fall 2022 - 38
IEEE Solid-States Circuits Magazine - Fall 2022 - 39
IEEE Solid-States Circuits Magazine - Fall 2022 - 40
IEEE Solid-States Circuits Magazine - Fall 2022 - 41
IEEE Solid-States Circuits Magazine - Fall 2022 - 42
IEEE Solid-States Circuits Magazine - Fall 2022 - 43
IEEE Solid-States Circuits Magazine - Fall 2022 - 44
IEEE Solid-States Circuits Magazine - Fall 2022 - 45
IEEE Solid-States Circuits Magazine - Fall 2022 - 46
IEEE Solid-States Circuits Magazine - Fall 2022 - 47
IEEE Solid-States Circuits Magazine - Fall 2022 - 48
IEEE Solid-States Circuits Magazine - Fall 2022 - 49
IEEE Solid-States Circuits Magazine - Fall 2022 - 50
IEEE Solid-States Circuits Magazine - Fall 2022 - 51
IEEE Solid-States Circuits Magazine - Fall 2022 - 52
IEEE Solid-States Circuits Magazine - Fall 2022 - 53
IEEE Solid-States Circuits Magazine - Fall 2022 - 54
IEEE Solid-States Circuits Magazine - Fall 2022 - 55
IEEE Solid-States Circuits Magazine - Fall 2022 - 56
IEEE Solid-States Circuits Magazine - Fall 2022 - 57
IEEE Solid-States Circuits Magazine - Fall 2022 - 58
IEEE Solid-States Circuits Magazine - Fall 2022 - 59
IEEE Solid-States Circuits Magazine - Fall 2022 - 60
IEEE Solid-States Circuits Magazine - Fall 2022 - 61
IEEE Solid-States Circuits Magazine - Fall 2022 - 62
IEEE Solid-States Circuits Magazine - Fall 2022 - 63
IEEE Solid-States Circuits Magazine - Fall 2022 - 64
IEEE Solid-States Circuits Magazine - Fall 2022 - 65
IEEE Solid-States Circuits Magazine - Fall 2022 - 66
IEEE Solid-States Circuits Magazine - Fall 2022 - 67
IEEE Solid-States Circuits Magazine - Fall 2022 - 68
IEEE Solid-States Circuits Magazine - Fall 2022 - 69
IEEE Solid-States Circuits Magazine - Fall 2022 - 70
IEEE Solid-States Circuits Magazine - Fall 2022 - 71
IEEE Solid-States Circuits Magazine - Fall 2022 - 72
IEEE Solid-States Circuits Magazine - Fall 2022 - 73
IEEE Solid-States Circuits Magazine - Fall 2022 - 74
IEEE Solid-States Circuits Magazine - Fall 2022 - 75
IEEE Solid-States Circuits Magazine - Fall 2022 - 76
IEEE Solid-States Circuits Magazine - Fall 2022 - 77
IEEE Solid-States Circuits Magazine - Fall 2022 - 78
IEEE Solid-States Circuits Magazine - Fall 2022 - 79
IEEE Solid-States Circuits Magazine - Fall 2022 - 80
IEEE Solid-States Circuits Magazine - Fall 2022 - 81
IEEE Solid-States Circuits Magazine - Fall 2022 - 82
IEEE Solid-States Circuits Magazine - Fall 2022 - 83
IEEE Solid-States Circuits Magazine - Fall 2022 - 84
IEEE Solid-States Circuits Magazine - Fall 2022 - Cover3
IEEE Solid-States Circuits Magazine - Fall 2022 - Cover4
https://www.nxtbook.com/nxtbooks/ieee/mssc_fall2023
https://www.nxtbook.com/nxtbooks/ieee/mssc_summer2023
https://www.nxtbook.com/nxtbooks/ieee/mssc_spring2023
https://www.nxtbook.com/nxtbooks/ieee/mssc_winter2023
https://www.nxtbook.com/nxtbooks/ieee/mssc_fall2022
https://www.nxtbook.com/nxtbooks/ieee/mssc_summer2022
https://www.nxtbook.com/nxtbooks/ieee/mssc_spring2022
https://www.nxtbook.com/nxtbooks/ieee/mssc_winter2022
https://www.nxtbook.com/nxtbooks/ieee/mssc_fall2021
https://www.nxtbook.com/nxtbooks/ieee/mssc_summer2021
https://www.nxtbook.com/nxtbooks/ieee/mssc_spring2021
https://www.nxtbook.com/nxtbooks/ieee/mssc_winter2021
https://www.nxtbook.com/nxtbooks/ieee/mssc_fall2020
https://www.nxtbook.com/nxtbooks/ieee/mssc_summer2020
https://www.nxtbook.com/nxtbooks/ieee/mssc_spring2020
https://www.nxtbook.com/nxtbooks/ieee/mssc_winter2020
https://www.nxtbook.com/nxtbooks/ieee/mssc_fall2019
https://www.nxtbook.com/nxtbooks/ieee/mssc_summer2019
https://www.nxtbook.com/nxtbooks/ieee/mssc_2019summer
https://www.nxtbook.com/nxtbooks/ieee/mssc_2019winter
https://www.nxtbook.com/nxtbooks/ieee/mssc_2018fall
https://www.nxtbook.com/nxtbooks/ieee/mssc_2018summer
https://www.nxtbook.com/nxtbooks/ieee/mssc_2018spring
https://www.nxtbook.com/nxtbooks/ieee/mssc_2018winter
https://www.nxtbook.com/nxtbooks/ieee/solidstatecircuits_winter2017
https://www.nxtbook.com/nxtbooks/ieee/solidstatecircuits_fall2017
https://www.nxtbook.com/nxtbooks/ieee/solidstatecircuits_summer2017
https://www.nxtbook.com/nxtbooks/ieee/solidstatecircuits_spring2017
https://www.nxtbook.com/nxtbooks/ieee/solidstatecircuits_winter2016
https://www.nxtbook.com/nxtbooks/ieee/solidstatecircuits_fall2016
https://www.nxtbook.com/nxtbooks/ieee/solidstatecircuits_summer2016
https://www.nxtbook.com/nxtbooks/ieee/solidstatecircuits_spring2016
https://www.nxtbook.com/nxtbooks/ieee/solidstatecircuits_winter2015
https://www.nxtbook.com/nxtbooks/ieee/solidstatecircuits_fall2015
https://www.nxtbook.com/nxtbooks/ieee/solidstatecircuits_summer2015
https://www.nxtbook.com/nxtbooks/ieee/solidstatecircuits_spring2015
https://www.nxtbook.com/nxtbooks/ieee/solidstatecircuits_winter2014
https://www.nxtbook.com/nxtbooks/ieee/solidstatecircuits_fall2014
https://www.nxtbook.com/nxtbooks/ieee/solidstatecircuits_summer2014
https://www.nxtbook.com/nxtbooks/ieee/solidstatecircuits_spring2014
https://www.nxtbookmedia.com