IEEE Circuits and Systems Magazine - Q3 2023 - 16

by STMicroelectronics, enables the automatic conversion
of pre-trained deep learning models to run on STM
MCUs with optimized kernel libraries [42]. TVM [93] and
AutoTVM [99] also supports microcontrollers (referred
to as μTVM/microTVM [43]). Compilation techniques
can also be employed to reduce memory requirements.
For instance, Stoutchinin et al. [100] propose to improve
deep learning performance on MCU by optimizing the
convolution loop nest. Liberis and Lane [44] and Ahn
et al. [101] present to reorder the operator executions
to minimize peak memory, whereas Miao and Lin [102]
seek to achieve better memory utilization by temporarily
swapping data off SRAM. With a similar goal of reducing
peak memory, other researchers further propose
computing partial spatial regions across multiple layers
[103], [104], [105]. Additionally, CMix-NN supports
mixed-precision kernel libraries of quantized activation
and weight on MCU to reduce memory footprint [46].
TinyEngine, as part of MCUNet, is proposed as a memory-efficient
inference engine for expanding the search
space and fitting a larger model [8]. TinyEngine transfers
the majority of operations from runtime to compile time
before generating only the code that will be executed
by the TinyNAS model. In addition, TinyEngine adapts
memory scheduling to the overall network topology as
opposed to layer-by-layer optimization. TensorFlow-Lite
Micro (TF-Lite Micro) is among the first deep-learning
frameworks to support bare-metal microcontrollers in
order to enable deep-learning inference on MCUs with
tight memory constraints [47]. However, the aforementioned
frameworks only support per-layer inference,
which limits the model capacity that can be executed
with only a small amount of memory and makes higherresolution
input impossible. Hence, MCUNetV2 proposes
a generic patch-by-patch inference scheduling, which
operates on a small spatial region of the feature map
and drastically reduces peak memory usage, and thus
makes the inference with high-resolution input on MCUs
feasible [9]. TinyOps combines fast internal memory
with an additional slow external memory through direct
memory access (DMA) peripheral to enlarge memory
size and speed up inference [49]. TinyMaix, similar to
CMSIS-NN, is an optimized inference kernel library, but
it eschews new but rare features and seeks to preserve
the readability and simplicity of the codebase [50].
B. Recent Progress on TinyML Training
On-device training on small devices is gaining popularity,
as it enables machine learning models to be trained
and refined directly on small and low-power devices.
On-device training offers several benefits, including
the provision of personalized services and the protection
of user privacy, as user data is never transmitted
16
IEEE CIRCUITS AND SYSTEMS MAGAZINE
to the cloud. However, on-device training presents additional
challenges compared to on-device inference, due
to larger memory footprints and increased computing
operations needed to store intermediate activations and
gradients.
Researchers have been investigating ways to reduce
the memory footprint of training deep learning models.
One kind of approach is to design lightweight network
structures manually or by utilizing NAS [85], [106], [107].
Another common approach is to trade computation for
memory efficiency, such as freeing up activation during
inference and recomputing discarded activation during
the backward propagation [108], [109]. However, such
an approach comes at the expense of increased computation
time, which is not affordable for tiny devices with
limited computation resources. Another approach is
layer-wise training, which can also reduce the memory
footprint compared to end-to-end training. However, it
is not as effective at achieving high levels of accuracy
[110]. Another approach reduces the memory footprint
by building a dynamic and sparse computation graph
for training by activation pruning [111]. Some researchers
propose different optimizers [112]. Quantization is
also a common approach that reduces the size of activation
during training by reducing the bitwidth of training
activation [113], [114].
Due to limited data and computational resources, ondevice
training usually focuses on transfer learning. In
transfer learning, a neural network is first pre-trained on
a large-scale dataset, such as ImageNet [115], and used
as a feature extractor [116], [117], [118]. Then, only the
last layer needs to be fine-tuned on a smaller, task-specific
dataset [119], [120], [121], [122]. This approach reduces
the memory footprint by eliminating the need to
store intermediate activations during training, but due
to the limited capacity, the accuracy can be poor when
the domain shift is large [52]. Fine-tuning all layers can
achieve better accuracy but requires large memory to
store activation, which is not affordable for tiny devices
[116], [117]. Recently, several memory-friendly on-device
training frameworks were proposed [123], [124], [125],
but these frameworks targeted larger edge devices (i.e.,
mobile devices) and cannot be adopted on MCUs. An alternative
approach is only updating the parameters of
batch normalization layers [126], [127]. This reduces the
number of trainable parameters, which however does
not translate to memory efficiency [52] because the intermediate
activation of batch normalization layers still
needs to be stored in the memory.
It has been shown that the activation of a neural network
is the main factor limiting the ability to train on
small devices. Tiny-transfer-learning (TinyTL) addresses
this issue by freezing the weights of the network and
THIRD QUARTER 2023
IEEE Circuits and Systems Magazine - Q3 2023

Table of Contents for the Digital Edition of IEEE Circuits and Systems Magazine - Q3 2023