IEEE Computational Intelligence Magazine - August 2021 - 11

DNNs that are suitable for a broad range of different memory
usage and energy consumption requirements. Under negligible
accuracy loss, EMOMC improves the energy efficiency and
model compression rate of VGG-16 on CIFAR-10 by a factor of
more than
89 . # and .42 #, respectively.
I. Introduction
D
eep neural networks (DNNs) are artificial neural networks
with more than three layers (i.e., more than one
hidden layer), which progressively extract higher-level
features from the raw input in the learning process.
They have delivered the state-of-the-art accuracy on various
real-world problems, such as image classification, face recognition,
and language translation [1]. The superior accuracy of
DNNs, however, comes at the cost of high computational and
space complexity. For example, the VGG-16 model [2] has about
138 million parameters, which requires over 500 MB memory
for storage and 15.5G multiply-and-accumulates (MACs) to
process an input image with 224 × 224 pixels. In myriad application
scenarios, it is desirable to make the inference on edge
devices rather than on cloud, for reducing the latency and
dependency on connectivity and improving privacy and security.
Many of the edge devices that draw the DNNs inference have
stringent limitations on energy consumption, memory capacity,
etc. The large-scale DNNs [3], [4] are usually difficult to be
deployed on edge devices, thus hindering their wide application.
Efficient processing of DNNs for inference has become increasingly
important for the deployment on edge devices. For generating
efficient DNNs, many neural architecture search (NAS)
approaches have been developed in recent years [5]-[7]. One way
of carrying out NAS is to search from scratch [8], [9]. In contrast,
model compression1 [10] searches for the optimal networks starting
from a well-trained network. For instance, to reduce the storage
requirement of DNNs, Han et al. proposed a three-stage pipeline
(i.e., pruning, trained quantization, and Huffman coding) to compress
redundant weights [10]. Wang et al. suggested removing
redundant convolution filters to reduce the model size [11]. Rather
than reducing the model size, a few attempts [12], [13] are conducted
to compress DNNs directly by taking the energy consumption
as the feedback signals. They have achieved promising results in
reducing the size of weight parameters (or energy consumption).
However, these approaches require the model to achieve approximately
no loss of accuracy, rendering the solution less flexible.
In practice, different users often have distinct preferences on
desirable objectives, e.g., accuracy, model size, energy efficiency, and
latency, when they select the optimal DNN model for their applications.
In this paper, a novel approach, called Evolutionary MultiObjective
Model Compression (EMOMC), is proposed to
optimize energy efficiency/model size and accuracy simultaneously.
By considering network pruning and quantization, the model
compression is formulated as a multi-objective problem under different
dataflow designs and parameter coding schemes. Each candidate
architecture can be regarded as an individual in the
1The technique aims to shrink the size of the neural network model without a significant
drop of accuracy.
evolutionary population. Owing to the cooperation and interplay
of the evolution between different architectures in the population,
a set of compact DNNs that offer trade-offs on different objectives
(e.g., accuracy, energy efficiency, and model size) can be obtained in
a single run. Unlike most existing approaches which aim to reduce
the size of weight parameters or the energy consumption with no
significant loss of accuracy, the proposed approach attempts to
achieve a good balance between desired objectives, for meeting the
requirements of different edge devices. Experimental results demonstrate
that the proposed approach can obtain a diverse population
of compact DNNs for customized requirements of accuracy,
memory capacity, and energy consumption.
The novelty and main contributions of this work can be
summarized as follows:
❏ The model compression problem is formulated as a multiobjective
problem. The optimal solutions are searched in the
network pruning and quantization space using a population-based
algorithm.
❏ To speed up the population evolution, a two-stage pruning/
quantization co-optimization strategy is developed based on
the orthogonality between pruning and quantization.
❏ The trade-offs between accuracy, energy efficiency, and
model size in model compression are explored by considering
different dataflow designs and parameter coding schemes.
The experimental results demonstrate that the proposed
method can obtain a set of diverse Pareto optimal solutions in
a single run. Also, it achieves a considerably higher energy
efficiency than current state-of-the-art methods.
II. Preliminaries
Network pruning and quantization are two commonly used
model compression techniques to improve the energy efficiency
in model inference and/or to shrink the size of the model.
Moreover, the dataflow design employed by edge devices and the
coding scheme applied to store the weight matrix both have a
significant impact on the performance of model compression.
A. Network Pruning and Quantization
For making the training easy, the networks are usually overparameterized
[14]. Pruning is a widely-used model compression
technique that can effectively reduce the energy consumption of
edge devices and shrink the model size [10]. Network pruning
removes some of the redundant parameters in the network by
setting their values as zeros. A well-trained neural network usually
contains a large number of weights whose values are relatively
small compared to other parameters. In most cases, these parameters
are not particularly important when performing model inference.
Hence, one can sort all the parameters in the model and
replace those parameters with the least absolute values by zeros,
while the accuracy of the model can still be maintained. For
instance, the pruning amount to be 33%, then one-third of the
parameters in the model will be replaced by zeros. In the inference
process, if the processing elements (PEs)2 whose input
2The PE is a basic unit to conduct computation in processors.
AUGUST 2021 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 11
IEEE Computational Intelligence Magazine - August 2021

Table of Contents for the Digital Edition of IEEE Computational Intelligence Magazine - August 2021