Signal Processing - November 2017 - 135

Li et al. utilized a deep CNN model that was pretrained on
the ImageNet data set [24]. A network-in-network (NiN) structure was used to enhance the abstraction ability of the model.
The final layer of the pretrained model was replaced by regression layers, which mapped the learned features onto subjective
scores. As in [23], image patches were regressed onto identical
subjective-quality scores during training.
The labeling of local patches with global subjective-quality
scores during training may be problematic. While the reported
prediction accuracy of this model was competitive with that of
handcrafted feature-based quality prediction models, it is not
reasonable to expect local image quality to closely agree with
global subjective scores, even when synthetic distortions are
applied homogeneously. Picture quality is inevitably space-varying because of the high degree of nonstationarity of picture contents and the complex perceptual interactions that occur between
content and distortions (such as masking). A variety of training
strategies have been studied as solutions to this problem.
Bosse et al. deployed a deeper, 12-layer CNN model fed
only by raw RGB image patches to learn a no-reference picture-quality model [25]. They proposed two training strategies:
patchwise training (similar to [23]) and weighted average patch
aggregation, whereby the relative importance of each patch was
weighted by training on a subnetwork. The o- verall loss function
was optimized in an end-to-end manner. The authors reported
state-of-the-art prediction accuracies on the major syntheticdistortion picture-quality databases.
To overcome overfitting problems that can arise from a lack of
adequate local ground-truth scores, several authors have suggested training deep CNN models in two separate stages: a pretraining stage, using a large number of algorithm-generated proxy
ground-truth quality scores, followed by a stage of re-gression
onto a smaller set of subjective scores. For example, [26] describes
a two-stage CNN-based no-reference-quality prediction model,
whereby local quality scores generated by a full-reference algorithm are used as proxy patch labels in the first stage of training. In the second stage, the feature vectors obtained from image
patches are aggregated using statistical moments, then regressed
onto subjective scores. In this instance, the first stage is patchwise training, while the second stage is imagewise training. Since
the local proxy scores reflect the nonstationary characteristics of
perceived quality, they are reasonable local regression targets,
and training of the CNN model is enabled by the abundant training samples. Following the second stage of training on human
ground-truth, their model attains highly competitive prediction
accuracy on the legacy data sets.
The same authors later developed a two-stage training
scheme for no-reference picture-quality prediction called the
deep image quality assessor (DIQA) [27]. The training process of that model was separated into an objective training
stage followed by a subjective training stage. Rather than using
a sophisticated picture-quality predictor to produce proxy
scores, they computed peak signal-to-noise (PSNR). Using
only convolutional layers, feature maps were obtained, which
were then regressed onto objective error maps. The second
stage aggregated the feature maps by weighted averaging, then

regressed these global features onto ground-truth subjective
scores. The weighting maps were also learned during training.
The reported prediction accuracy of these models is competitive with state-of-the-art models on the legacy databases.

CNN-based full-reference picture-quality models
While CNNs were first used to model no-reference picture
quality, more recently, they have been applied to the reference
prediction problem as well.
Liang et al. [28] proposed a dual-path CNN-based full-reference-quality prediction model. They generalized the problem by
seeking to predict quality using a nonaligned image of a similar
scene as a reference. Locally normalized distorted and reference
image patches are fed into a dual-path CNN model, each using
the same parameter values. Then the concatenated learned feature vectors are regressed onto the subjective scores of source distorted images. They report state-of-the-art prediction accuracies
in both aligned and nonaligned full-reference scenarios.
Gao et al. deployed a deep CNN model pretrained on ImageNet. They used it to conduct full-reference picture-quality
prediction [29] by feeding pairs of reference and distorted pictures into the CNN, where each output layer is used as a feature
map. Local similarities between the feature maps obtained
from the reference and distorted images are then computed
and pooled to arrive at global picture-quality scores. The CNN
model was not fine-tuned on any picture-quality database.
The deep CNN-based full-reference-quality prediction
model in [30], called DeepQA, was trained to learn a visual
sensitivity weight at each coordinate using measured local spatial characteristics of the distorted image. DeepQA accepts the
distorted image and an objective error map (e.g., mean squared
error) as inputs. The learned weight map is then used as a multiplier on the objective error map. The authors reported consistent s- tate-of-the-art prediction accuracies as compared to other
reference-quality models, on the synthetic-distortion legacy
picture-quality databases.

Summary of CNN-based picture-quality models
Table 1 compares the implementations of reported CNN-based
no-reference [23]-[27] and full-reference [28]-[30] picture-quality models. For full-reference models, the strategies used to compare distorted and reference features are summarized in the last
column. In [28] and [30], this merely amounts to supplying both
to the network. Generally, the reviewed models were designed
to overcome the lack of training data, which is the most important issue that needs to be resolved to employ deep CNN models
successfully. Most of the models used some type of patch-based
training to increase the training data volume. Several of the models used proxy ground-truth scores generated by objective-quality
prediction models to augment the subjective scores or, alternately, to pretrain the network on a large amount of easily generated
proxy data before fine-tuning on subjective scores. Since we have
found no serious attempts to use unsupervised deep models, we
make no comparisons of this type, although the success of the
very simple model [31] suggests this is an interesting research
direction. Finding ways to embody models of perception into

IEEE SIGNAL PROCESSING MAGAZINE

November 2017

135

Table of Contents for the Digital Edition of Signal Processing - November 2017