Signal Processing - November 2017 - 134

relevant, perhaps higher-level processes, still limit their prediction efficacy [3]. Traces of such human response properties exist
and are embedded in human subject data. This suggests that they
might be unraveled by a deep network served by enough data.

Conventional learning-based picture-quality predictors
The most successful reference picture-quality predictors, such
as those deployed by the television industry, such as the Emmywinning structured similarity (SSIM) model [15] and the visual information fidelity (VIF) index [18] (a core element of the
VMAF processing system that quality-controls all Netflix content
encodes) are not learned models but instead compute similarity
or error measures modulated by perceptual criteria in some manner. Performance is high since a reference error, whether implicit
or explicit, is available to be analyzed using perceptual models.
No-reference models operate without the benefit of an implied
error signal, so their design has relied heavily on machine learning. Broadly, these models deploy perceptually relevant, lowlevel feature extraction mechanisms based on simple, yet highly
regular, parametric models of good-quality pictures. These natural scene statistics (NSS) models are predictably altered by the
presence of distortions [18]. Simply stated, high-quality images
subjected to bandpass filtering, followed by local energy normalization, become substantially decorrelated and Gaussianized,
while distorted images tend not to obey this model (although this
is not always the case on authentically distorted pictures, as demonstrated in [3]). Picture-quality prediction models of this type
have been developed in the wavelet [18], discrete cosine transform, sparse [8] and spatial domains [9], and have been applied
to video signals using natural bandpass space-time video statistics models [19], [20]. The FRIQUEE model [21] achieved stateof-the-art performance on the LIVE Challenge database [3] by
regressing on a "bag" of NSS features drawn from diverse color
spaces and perceptually motivated transform domains.

There have also been recent attempts to apply other, earlier
types of deep-learning models to the no-reference picture-quality prediction problem. For example, Hou et al. trained a deep
belief network on wavelet domain NSS features to classify distorted images into five discrete score categories [17], and Li et
al. regressed shearlet NSS features onto subjective scores using
a stacked autoencoder [22]. These models generally used handcrafted feature inputs, were not trained via end-to-end optimization, and achieved less impressive gains in performance.

CNN-based picture-quality prediction
CNN-based no-reference picture-quality models
As mentioned previously, several CNN-based picture-quality
prediction models have attempted to use patch-based labeling to
increase the set of informative (ground-truth) training samples.
Generally, two types of training approaches have been used:
patchwise and imagewise, as depicted in Figure 3. In the former,
each image patch is independently regressed onto its target. In
the latter, the patch features or predicted scores are aggregated or
pooled, then regressed onto a single ground-truth subjective score.
The first application of a spatial CNN model to the picturequality prediction problem was reported in [23], wherein a
high-dimensional input image was directly fed into a shallow
CNN model without finding handcrafted features. To obtain
more data, each input image was subdivided into small patches as a method of data augmentation, each being assigned the
same subjective-quality score during training. Following prior
successful NSS-based models [9], [18], this method applies a
process of local divisive normalization on each input image
and uses both maximum (max) and minimum (min) pooling
to reduce the feature maps. Patchwise training was used, and,
during application, the predicted patch scores were averaged
to obtain a single picture-quality score.

Global
Subjective
Score

Proxy
Local
Scores
or
CNN

Patchwise Training

CNN
CNN CNN

Distorted Image

Local
Training Targets

CNN
CNN

Divide into
Patches

Image
Patches

Imagewise Training

Shared
Deep Model

Aggregation
/Pooling

Global
Training Target

FIGURE 3. Patchwise and imagewise strategies used to train patch-based picture-quality prediction models. First, an input image is partitioned into
patches; then, each is fed into the same CNN model. In patchwise training, a proxy local score or global subjective score is used as a training target for
each input patch. In imagewise training, extracted features or scores are aggregated, then regressed onto a single, global subjective score.

134

IEEE SIGNAL PROCESSING MAGAZINE

November 2017

Table of Contents for the Digital Edition of Signal Processing - November 2017