Signal Processing - May 2017 - 43

Perceptually motivated multichannel
recording and reproduction
There has been some recent work in the direction of developing systematic frameworks for the design of multichannel
stereo systems, most notably vector-base amplitude panning
(VBAP), directional audio coding (DirAC), and perceptual sound-field reconstruction (PSR).

Vector-base amplitude panning
It was shown as early as 1973 that tangent panning provides a stereophonic image that is more robust to head rotations than sine
panning for the standard stereophonic loudspeaker setup [30].
Pulkki showed that tangent panning can be expressed using an
equivalent, vector-based formulation in the horizontal plane and
also proposed a three-dimensional (3-D) extension to two-channel intensity panning that allows rendering elevated virtual sources over flexible loudspeaker rigs [41]. This method is VBAP.
Originally, VBAP was designed for a loudspeaker array
with elements placed on a geodesic dome's vertices that are
situated at the acoustic far field of the listener. Figure 3 shows
a section of such a sphere with three loudspeakers, with a
listener positioned at the center of the array. The directions
of the three loudspeakers are indicated as v 1, v 2, and v 3, and
the corresponding gains as g 1, g 2, and g 3 . A virtual source in
a direction v s between the loudspeakers can be generated by
selecting the gains that satisfy v s = Vg, where V is a matrix
whose columns are the directions of the loudspeakers and
g = [g 1 g 2 g 3] T . In addition, the calculated loudspeaker gains
are normalized to keep the total power constant.
On the full geodesic sphere, active regions are selected
based on the closest three points on the grid, and only those
loudspeakers are used for source rendition. This is in contrast
with physically based approaches such as Ambisonics, where
even for a single source from a single direction, all loudspeakers are potentially active. A major assumption behind VBAP in
three dimensions is that summing localization would occur not
only with two sources but also with three. This assumption was
subjectively tested for different setups and virtual source directions, and it was shown to result in a good subjective localization accuracy for elevated virtual sources [42], [43].
An issue resulting from the utilization of intensity panning in VBAP is the nonuniformity of the spatial spread of the
panned source. More specifically, sources panned closer to the
actual loudspeakers in the reproduction rig have a smaller spatial spread, while virtual sources panned to directions between
loudspeakers have a larger spatial spread. The main cause of this
issue is the use of a single loudspeaker when the virtual source
direction coincides with the direction of that loudspeaker.
This issue was addressed by panning the virtual source to
multiple directions by using three loudspeakers (instead of two)
for all source directions in the horizontal plane, or four loudspeakers (instead of three) in the 3-D case. This approach was
called multiple-direction amplitude panning (MDAP) [44]. In
a study comparing VBAP with MDAP, it was shown that both
provide good subjective localization accuracy, with MDAP
being more accurate than VBAP [45]. In another, more recent

vs
v1

FIGURE 3. An arrangement of three loudspeakers and a phantom image
panned using VBAP. The vectors used in the formulation of VBAP are
also shown.

evaluation carried out within the context of the MPEG-H standard, VBAP resulted in very good subjective localization accuracy, including not only the source azimuth but also its distance
[46]. In yet another study, VBAP was shown to provide good
localization performance also for sources in the median plane
[47]. Note that VBAP is a technology for sound-field synthesis,
and in the context of sound-field recording and reproduction it
is used at the reproduction end of schemes such as DirAC.

Spatial encoding methods
A class of multichannel audio methods involves dividing
recorded signals into time or time-frequency bins and estimating certain spatial attributes within each bin. One of these methods is the spatial impulse response rendering (SIRR) method
[48], [49]. At the recording stage, SIRR records the impulse
response of a room using a B-format microphone, i.e., a microphone that provides the omnidirectional sound pressure component as well as the three axial pressure-gradient components of
the sound field [28]. The impulse response is first transformed
into a time-frequency representation and is then processed to
obtain estimates of the acoustic intensity vectors at each time-
frequency bin. It is assumed that each time-frequency bin corresponds to a single plane wave and thus that the direction of
the acoustic intensity vector also represents the direction of that
plane wave. A diffuseness estimate is obtained for each time-
frequency bin using the ratio of the real part of the acoustic
intensity to the total energy. These parameters, along with the
sound pressure component obtained from the B-format recording, form the basis of the reproduction stage.
At the reproduction stage, direct and diffuse parts of the
signal are treated differently. For the direct part, azimuth
and elevation estimates in each time-frequency bin are used
to pan portions of the B-format omnidirectional component,
accordingly using VBAP. The diffuse part is reproduced by
generating multiple decorrelated copies of the recorded sound
played back from all the loudspeakers. The so-obtained channel impulse responses are then convolved with the desired
anechoic sound sample. A similar method, called the spatial
decomposition method (SDM), was recently proposed in [50].

IEEE Signal Processing Magazine

May 2017

Table of Contents for the Digital Edition of Signal Processing - May 2017