Signal Processing - November 2017 - 133

preprocessed pixel values) can be fed into the CNN, and, over
many iterations or epochs of training on a large data set, useful
image representations are learned automatically. In the early layers of a deep CNN, low-level encoding or sparsifying features are
learned, possibly followed by intermediate descriptors of feature
correlations [7]. In the deeper layers, the learned features contain
more abstract information that can capture relationships between
image distortions and human perceptions of them. In a CNN, differentiable feature aggregation or pooling stages are interspersed
with feature extraction and regression stages, enabling effective
end-to-end optimization. However, despite significant successes
on a wide array of other image analysis problems, the application
of deep learning networks to the picture-quality prediction problem has been complicated by a significant obstacle, which is a
lack of an adequate amount of perceptual training data, including
accurate local ground-truth scores.
The performance of deep-learning models generally depends
heavily on the size of the available training data set(s). Currently
available legacy, public-domain, subjective picture-quality databases such as LIVE IQA [12] and TID2013 [13] are far too small
to effectively train deep learning models. For example, the LIVE
IQA and TID2013 databases each contains fewer than 30 unique
image contents and no more than 24 different types of distortions
per image, all of which are synthetic [This is as applied to pristine
images by a database designer. Algorithm-generated distortions
such as Gaussian blur (GB), noise, mean shifts, and so on, contained in these databases are poor models of picture impairments
that actually arise in consumer digital photographs. Even JPEG/
JPEG2000-coded images are created using much more liberal
amounts and spreads of compression (to create perceptual separations) than those produced by real image capture devices.] Even the
recent LIVE "In the Wild" Challenge Database (hereafter, LIVE
Challenge) [3], the largest available resource in most dimensions
(with nearly 1,200 unique pictures, each afflicted by a unique,
unknown combination of highly diverse authentic distortions and
judged by more than 350,000 unique human subjects) is of insufficient size, although it provides an excellent challenge for any noreference model. By comparison, image recognition data sets such
as ImageNet [14] contain tens of millions of labeled images. Creating larger subjective quality data sets is a formidable problem. Controlled laboratory studies like [12] and [13] are out of the question,
and even the crowdsourced study in [3] exhausted the pool of highquality human subjects available on Amazon Mechanical Turk.
Obtaining adequate quantities of reliable human subjective
labels remains a very difficult problem. Unlike the binary (yes/no)
confirmations of automatically generated labels that are delivered
by online human subjects, as used in the construction of object
recognition data sets like ImageNet [2], each of which might be
generated in a second or less, collecting human-quality judgments
is a complex, time-consuming psychometric task that is as much
about assessing each subject's response, as it is about the quality of the labeling the images. The human subjects determine
an internal judgment of the overall quality of each image after
holistically scrutinizing it, then record each of their judgments on
a continuous, sliding subjective-quality scale, while consciously
discounting factors such as image content or photographic aes-

thetics. This highly engaging task requires dozens or even hundreds of human-quality raters to spend 5-10 s on each image.
Each subject's overall session is time-limited, to avoid reductions
in attention and performance arising from vision fatigue.
Common strategies for attacking this labeled image paucity
are data augmentation techniques, which seek to multiply the
effective volume of image data via rotations, cropping, reflections, and so on. Unfortunately, with the likely exception of
horizontal reflections, which we use later, applying these kinds
of transformations to an image will generally significantly
change its perceived quality. While generating a large amount
of picture content is simple, ensuring adequate distortion diversity and realism is much harder.
In another common strategy, the images used for training are
divided into many small patches. However, this approach produces another problem-distinct local ground-truth subjective
labels are not available for each of the patches. In every experimental scenario to date, human subjects supply a single scalar
subjective score on each global image. Since images, distortions
of images, and human perceptions of both are all highly nonstationary, the scores that subjects would apply to a local image
patch will generally differ greatly from those applied to the
entire image. Obtaining human judgments of local image patch
quality is not practical, as it would greatly increase the overhead
of acquiring human scores.
One way to try to overcome the lack of an adequate training data set is to utilize unsupervised learning, e.g., by training a
restricted Boltzmann machine or an autoencoder [4] with convolutional layers. With an unsupervised model, it is possible to train
deep NN models on very large data sets having no ground-truth
labels. However, picture-quality prediction is a subtle problem
that involves modeling detailed interactions between distortion
and content. Conversely, unsupervised models that are designed
to work well on tasks such as image recognition, may succeed in
part by learning to promote gross shape-related features, while
suppressing small variations. For example, a denoising autoencoder can be trained to reconstruct an original image from a noisy
one by enforcing robustness against small corruptions of the input
data or adding a regularization term to the objective function. By
contrast, the representations learned by a picture-quality predictor must be particularly sensitive to local and global degrees of
distortion as well as perceived interactions between content and
distortion. Successful, generalizable, deep unsupervised picturequality prediction models have not yet been reported.
The need for large-scale subjective picture-quality data is
underlined by the fact that the perception of picture distortions
engages multiple complex processes along the visual pathway,
including bandpass, multiscale, and directional decompositions
[6]; local nonlinearities; and normalization mechanisms. For
example, contrast masking [15], whereby the spatially localized
energy of image content can reduce or eliminate the visibility
of distortions, is well explained by a local cortical divisive normalization model [16]. Successful reference and no-reference
picture-quality models [9], [10], [15], [17] approximate these perceptual mechanisms by various models. However, errors in these
approximations, along with a lack of information describing other

IEEE SIGNAL PROCESSING MAGAZINE

November 2017

133

Table of Contents for the Digital Edition of Signal Processing - November 2017