IEEE Circuits and Systems Magazine - Q4 2021 - 25

I. Introduction
T
he Mel-scale Frequency Cepstral Coefficients
(MFCC) can convert the input speech signal into
a series of acoustic feature vectors, which is an
effective feature extraction algorithm. It is a very fundamental
calculation in deep neural network (DNN) based
Automatic Speech Recognition (ASR) systems [1]. In Veton's
work [2], through experimental comparison and
analysis, different feature extraction algorithms were
evaluated, including the MFCC, the linear predictive coefficients
[3], the perceptual linear production [4] and the
relative spectral analysis perceptual linear prediction [5].
The comparison results show that MFCC can achieve a
better trade-off between hardware overhead and recognition
accuracy under various background noises and
signal-to-noise ratios (SNRs). There are many works on
the design and optimization of MFCC, especially on how
to improve the accuracy of speech recognition in various
applications [6]. However, all these works require a large
amount of calculations, which will significantly increase
the hardware resources and power consumption of the
speech recognition system. Speech keywords recognition
is a very widely used ASR system that processes the wakeup
mechanism for human-machine interaction interfaces
in many battery-powered devices, such as wearable devices,
mobile devices and the Internet of Things (IoTs) [7].
In these battery-powered devices, the speech keywords
recognition system is usually required to be always-on,
therefore the key requirement is to design low-power consumption
while ensuring high recognition accuracy.
In a typical KWS hardware processor, which is usually
composed of a feature extraction module using
MFCC and a classification module using various DNNs,
the power consumption of the feature extraction module
accounts for about 50% of the total system power consumption
[8]. However, most of the research works are
focusing on methods to improve the energy efficiency
of the DNN accelerator, and the researches on reducing
power consumption of MFCC are still in great lack
module [9], [10]. In Liu's work [11], they proposed an energy-efficient
MFCC module for speech recognition with
optimized Fast Fourier Transform (FFT) and reduced
bit-width, which can significantly reduce the hardware
power consumption up to 27.0% with an accuracy of loss
less than 2%.
The focus of the previous works is either coarsegrained
or fine-grained optimizations on word length
and circuit structures which impact the power consumption
of each calculating and memory access. In this
paper, we focus on the low-power design and optimization
of the precision-adaptive MFCC feature extraction.
In summary, we make the following key contributions:
■ To the knowledge of the authors, this is the first
approximate computing inspired MFCC architecture
for keywords recognition that can be adapt
to different input speech frame lengths and high
recognition accuracy under various background
noise scenarios. Compared to state-of-the-art designs,
the proposed MFCC performs up to 76.3%
lower in power consumption, while the accuracy
increased by 0.8%.
■ An 8-Stage Radix-2 Single-path Delay Feedback
FFT (R2SDF-FFT) structure with bit-width stepwise
quantization is utilized for the proposed power-constrained
MFCC, which can reduce 35.7% of
memory size, 36.7% of the number of transistors
and 37.4% of power consumption respectively. We
also proposed an approximate multiplication and
addition architecture with Dual-Vdd to improve
the computing energy efficiency of the R2SDF-FFT,
which can further reduce the energy cost of the
R2SDF-FFT by 47%.
The rest of this paper is organized as follows. Section II
presents the principle of the MFCC for a typical DNN
based speech keywords recognition system. Section III
presents the details of the proposed approximate computing
technologies to reduce the power consumption
of the MFCC while maintaining a high recognition accuracy.
Then, the proposed approximate MFCC designs
are evaluated and verified on TSMC 22 nm ultra-lowleakage
(ULL) technology process. The implementation
results and analysis are given in Section IV. Section V
concludes the paper.
II. Principle and Analysis of MFCC Feature
Extraction for DNN Based Keywords Recognition
A typical MFCC extraction structure is mainly composed
of six fundamental sub-modules: First, the pre-emphasis
module, which pre-emphasizes the input speech data
through a high-pass filter to amplify high-frequency
components. The second is the framing module, the preemphasized
output of the previous module is blocked
into frames of N samples. Each frame of the input speech
is computed as 32 ms with a step of 16 ms. The third is
(Corresponding authors: Hao Cai (e-mail: hao.cai@seu.edu.cn), Weiqiang Liu (e-mail: liuweiqiang@nuaa.edu.cn), and Jun Yang (e-mail: dragon@seu
.edu.cn).
B. Liu, X. Ding, H. Cai, W. Zhu, J. Yang are with the National ASIC System Engineering Center, Southeast University, Nanjing, 210096, China. Z. Wang
is with Nanjing Prochip Electronic Technology Co. Ltd, Nanjing 210001, China. W. Liu is with the College of Electronic and Information Engineering,
Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China.
FOURTH QUARTER 2021
IEEE CIRCUITS AND SYSTEMS MAGAZINE
25

IEEE Circuits and Systems Magazine - Q4 2021

Table of Contents for the Digital Edition of IEEE Circuits and Systems Magazine - Q4 2021