Authors:
(1) Joshua P. Ebenezer, Student Member, IEEE, Laboratory for Image and Video Engineering, The University of Texas at Austin, Austin, TX, 78712, USA, contributed equally to this work (e-mail: [email protected]);
(2) Zaixi Shang, Student Member, IEEE, Laboratory for Image and Video Engineering, The University of Texas at Austin, Austin, TX, 78712, USA, contributed equally to this work;
(3) Yixu Chen, Amazon Prime Video;
(4) Yongjun Wu, Amazon Prime Video;
(5) Hai Wei, Amazon Prime Video;
(6)Sriram Sethuraman, Amazon Prime Video;
(7) Alan C. Bovik, Fellow, IEEE, Laboratory for Image and Video Engineering, The University of Texas at Austin, Austin, TX, 78712, USA.
A. Internal Correlation
The internal correlations for the three groups were calculated as follows. Let the score given by subject i for video j on TV t be given by uijt. The Z score is computed as
The subjects were randomly divided into two equal groups and the average Z score computed for each video across all subjects in that group. The correlation between the scores provided by these groups for all the videos was computed over 100 trails with different random groupings. The median inter-subject correlation for viewers of TV1 was found to be 0.95, for TV2 was 0.94, and for TV3 was 0.93. These data indicate a high degree of internal consistency and reliability.
B. Calculation of MOS
The Mean Opinion Scores (MOS) were obtained using the Maximum Likelihood Estimation method proposed in ITU 910 [30]. The MOS is modelled as a random variable
where Ψjn is the true quality of video j viewed on TVn, ∆i is the bias of subject i, νi represents the inconsistency of subject i, and X ∼ N(0, 1) are i.i.d. Gaussian random variables. Given the scores uijn, the true score for each video on each television is estimated by treating Ψjn, ∆i , νi as free parameters that are solved so that the model in (2) is the best fit to the observed MOS. Specifically, Ψjt is by maximizing the log-likelihood of the observations using the Newton-Raphson solver.
The Differential Mean Opinion Scores (DMOS) were calculated between the distorted and reference videos by finding the differences in MOS as follows:
where j0 is the index of the reference video corresponding to video j viewed on television TVn. The MOS and DMOS are thus computed for each video and separately for each television.
C. Analysis of Scores
The MOS of 20 videos of 4 contents (1 content from each source group and 5 videos per content) are plotted against bitrate in Figs. 2, 3, 4, and 5. The Forge video is from group 1, EPL is from group 2, ColorDJ is from group 3, and Porsche is from group 4. Screenshots from the SDR versions of the source contents are shown in the first column. The points on each line correspond to 540p, 720p, 1080p, 1440p, 2160p, and 2160p in increasing order, following Table I.
The Forge video shows a sword being forged in a smithy. The light from the hot iron contrasts strongly with the darkness and shadows around it and in the background. The HDR version of the Forge video is rated higher than the SDR version on TV1, since details are clearer in the HDR version. However, on TV2 and TV3, due to their reduced ability to display contrasts, higher compression of the HDR video offsets the relative quality gain from the increased contrast, and the SDR version was thus rated as better. The HDR version appears darker and dimmer than the SDR version, due to the way in which HDR is displayed differently from SDR (as discussed in the Introduction) as well as the way in which the video was graded. The lower average brightness may also contribute to the lower perceived quality of the HDR version of the video content. The lower average brightness of the HDR version, the reduced contrast of TV2 and TV3, and the high spatial complexity of the scene, may also explain why, at higher bitrates, the SDR version was still rated higher. Similar observations can be made about the EPL video taken from group 2, where the HDR version was rated better on TV1, while the SDR version was rated better at lower bitrates on TV2 and TV3. At higher bitrates, the HDR version was rated higher, as compressive artifacts had less visual impact, despite the high temporal complexity of the soccer scene.
The SDR version of the ColorDJ video suffers from oversaturation and overexposure due to the wide range of colors and brightnesses present. The HDR version, on the other hand, does not exhibit over-saturation or overexposure because of its greater bit-depth and wider color gamut. This may explain why the HDR version was generally rated better on all the televisions.
The Porsche video has a bright red paint on the car which is accurately represented in HDR, but looks saturated in SDR. The video is also not spatially or temporally complex, which may be because of the superior ability of HDR to represent contrasts and bright colors, outweighing the effects of compression. However, there is a sharp drop in the MOS vs bitrate curve for the 720p HDR version of the video, encoded at 1515 kbps. The encoder made the decision to encode the 1080p version at 1300 kbps, yet the 1080p version is still rated higher than the 720p version. This is likely because rescaling artifacts are more prominent than the compressive artifacts on this content due to its low spatial complexity.
The average difference between the MOS of the HDR and SDR versions of all the videos are plotted against the maxrate of each television in Fig. 6. As may be seen, at lower bitrates SDR was rated higher than HDR, while at higher bitrates the difference becomes positive. The drops in the curve of 1080p content at maxrate 3000 kbps and of 2160p at maxrate 6000 kbps are indicative of how for an optimal bitrate ladder, these resolutions should be encoded at higher bitrates. It is important to note that these compression levels were included in the study to represent a wide range of quality, and not to represent an optimal bitrate ladder. For TV1, HDR was better than SDR at even low bitrates (3000 kbps) and the difference in quality increased to 5 MOS units at the highest bitrate. For TV2 and TV3, SDR quality was better than HDR quality at low bitrates but the differences decreased and became slightly positive at higher bitrates. This also indicates how the capabilities of the display devices strongly influence the quality of HDR relative to that of SDR content.
D. Combining Databases
The scores given to the 27 anchor videos from the LIVE HDR, LIVE AQ HDR, and the APV HDR Sports VQA datasets were used to map scores from those databases to the current database. A logistic function was fitted to map the scores assigned to the anchor videos in those databases to the scores of the same videos in the LIVE HDRvsSDR database for each television. The logistic function is
where x are the scores of the videos in prior databases, f(x) is the mapping to scores in the LIVE HDRvsSDR database for a particular television, and a, b, c and s were separately solved for on each database using the anchor videos for that database. Since each television and each (prior) database was used to generate a different fitting function, a total of nine functions were derived from the data. The functions that map scores from the LIVE HDR database and the LIVE AQ HDR database to the scores obtained from the three televisions used in the LIVE HDRvsSDR database are plotted in Figs. 12 and 13 in supplementary material, respectively. The mappings and results for the APV HDR Sports database cannot be shown, for proprietary reasons.
Deriving these functions from the scores of the anchor videos enables the merging of the three LIVE databases into a single, large-scale database of 1066 videos that were collected in a controlled laboratory environment.
This paper is available on arxiv under CC 4.0 license.