Signal Processing - May 2017 - 37

An overview of spatial-audio techniques
based on psychoacoustics

Hüseyin Hacıhabibog
˘ lu, Enzo De Sena, Zoran Cvetkovi´c,
James Johnston, and Julius O. Smith III

past nine decades of spatial audio reproduction and synthesis
have seen innovations and developments in all these directions.
Generating the experience of a spatial sound scene can be
achieved in a number of ways. Comparing different methods,
at one extreme there are binaural techniques [1], which provide a convincing experience over two channels by presenting
stereophonic audio cues, i.e., interaural time, level, and spectral
differences, which are known as ear signals. Binaural presentations work best over headphones. However, with crosstalk cancelation [2], they can also be successfully used with a pair of
loudspeakers, although the effect is confined to a very narrow
listening area. For a listener who is not static, the auditory illusion can be maintained via head-tracking mechanisms combined
with the real-time adaptation of the binaural signals. The advent
of virtual and augmented reality systems has recently revived
interest in binaural systems. However, some inherent problems
of binaural audio, such as individualization, remain [3], limiting
the spatial quality of the auditory experience they provide.
At the other extreme, there are systems that aim to reconstruct an accurate physical approximation of a sound field.
Notable examples include wave field synthesis (WFS) [4]
and higher-order Ambisonics (HOA) [5]. WFS is based on
the Huygens principle and Kirchhoff-Helmholtz integral,
which together state that the sound field due to a primary
source can be exactly synthesized by infinitely many secondary sources on the surface enclosing a reproduction volume.
Such a system can achieve a spatially extensive listening area
and can be used in large auditoriums, such as film theaters.
Ambisonics is based on sound-field approximation using
its spherical harmonics at the center of the listening area.
HOA is capable of achieving results comparable to WFS
close to the center of the reproduction rig. While both WFS
and HOA provide elegant solutions to the spatial recording
and reproduction problem, they have high equipment load
requirements, which can reach several hundreds of carefully
positioned loudspeakers. For this reason, their application
domain has so far been confined to specialist high-end systems. WFS and HOA can also run on systems with a more
practical equipment load by including perception-inspired
corrections. Comprehensive reviews of WFS and HOA have
recently been published [6], [7].

In between these two extremes are systems with five to ten
channels that are suitable for use in small to medium-size listening rooms. Such systems do not possess a sufficient number
of channels to physically reconstruct a sound field in a wide
listening area, nor are they capable of accurately reconstructing the ear signals for listeners in multiple locations. Therefore,
they must rely to a large degree on perceptual effects similar
to those used for binaural systems to generate the illusion of a
desired sound field within not overly confined areas.
As with recording and reproduction technologies, there are
many techniques for sound-field simulation. At one extreme,
there are physically motivated methods, which aim to calculate an approximate solution of the wave equation. For that
purpose, several numerical methods have been developed that
achieve a very high level of accuracy. However, they typically
have prohibitively high computational costs. Examples include
the finite-difference time domain, finite element method
(FEM), and boundary element method (BEM) [8]. While these
approaches lend themselves to parallelization, the associated
computational cost is still too high for real-time operation at
interactive rates and on low-cost devices.
Conversely, there are methods that try to render only some
higher-level perceptual effects. These methods, called artificial reverberators, require only a fraction of the computational
load associated with physically motivated room simulators and
typically aim to mimic only certain characteristics of the tail of
typical room impulse responses, such as modal density, echo
density, and timbral quality [9]. They do not explicitly model
a given physical space but, rather, are used to obtain a pleasing
reverberant effect and have been widely used for artistic purposes in music production.
In between these two extremes are methods that aim to render a certain physical sound scene, but by modeling only its
most perceptually relevant aspects. Full-blown room auralization systems typically aim to render each and every reflection
and diffraction up to a given order for each source [10], [11].
More recent methods achieve remarkable computational savings by accurately rendering only first-order reflections, while
replacing higher-order reflections with their progressively
coarser approximations [12]. Further computational savings are
possible by eliminating sources whenever they are inaudible, a

IEEE SIgnal
Signal ProcESSIng
Processing MagazInE
Magazine

May 2017

Table of Contents for the Digital Edition of Signal Processing - May 2017