paint-brush
HDR or SDR? A Study of Scaled and Compressed Videos: Details of Subjective Studyby@escholar
170 reads

HDR or SDR? A Study of Scaled and Compressed Videos: Details of Subjective Study

tldt arrow

Too Long; Didn't Read

While conventional expectations are that HDR quality is better than SDR quality, this paper finds that viewers' preference depends heavily on the display device.
featured image - HDR or SDR? A Study of Scaled and Compressed Videos: Details of Subjective Study
EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

Authors:

(1) Joshua P. Ebenezer, Student Member, IEEE, Laboratory for Image and Video Engineering, The University of Texas at Austin, Austin, TX, 78712, USA, contributed equally to this work (e-mail: [email protected]);

(2) Zaixi Shang, Student Member, IEEE, Laboratory for Image and Video Engineering, The University of Texas at Austin, Austin, TX, 78712, USA, contributed equally to this work;

(3) Yixu Chen, Amazon Prime Video;

(4) Yongjun Wu, Amazon Prime Video;

(5) Hai Wei, Amazon Prime Video;

(6)Sriram Sethuraman, Amazon Prime Video;

(7) Alan C. Bovik, Fellow, IEEE, Laboratory for Image and Video Engineering, The University of Texas at Austin, Austin, TX, 78712, USA.

III. DETAILS OF SUBJECTIVE STUDY

The study was conducted on 356 videos shown to 67 subjects. The videos were generated from a set of 25 unique and pristine contents also converted to SDR, all subjected to combinations of downscaling and HEVC compression. Three different television technologies were used to conduct the study.


A. Source Sequences

The 25 source sequences can be divided into 4 groups: Video on Demand (VoD), Live Sports, HDR Demo Videos, and SJTU videos. Three of the source sequences were “anchor” sequences taken from prior HDR VQA databases (LIVE HDR, LIVE AQ HDR, and APV HDR Sports) to calibrate and combine their data with the present database. All of the videos are represented in the BT2020 [26] color gamut and were quantized using the SMPTE ST2084 [27] Opto-Electronic Transfer Function (OETF), also known as the Perceptual Quantizer (PQ). All of the source sequences have durations in the range 7s and 10s, with static metadata conforming to the HDR10 standard. The video categories are described in detail as follows:


1) VoD: These 7 videos were professionally captured and graded for VoD streaming services. The HDR and SDR versions of these videos were prepared by Amazon Studios. One of the contents used in this category is an “anchor” video from the LIVE AQ HDR database. The SDR versions were created and manually graded with creative intent by professional graders.


2) Live Sports: The 4 Live Sports videos were captured professionally by broadcasters at stadiums hosting live matches of Soccer and Tennis. One content from this category is an “anchor” video from the APV HDR Sports database. Due to the low-latency requirements of live broadcasts, these videos were graded using preset Lookup Tables (LUTs) for both HDR and SDR formats. The LUT used for the HDR grading is a proprietary Amazon LUT, while the LUT used for the HDR to SDR conversion is the open-source NBC LUT. All the videos were originally in YUV422 10-bit format and were converted to the limited YUV420 10-bit format.


3) HDR Demo Videos: These are a set of 8 open-source videos collected from 4kmedia.org. The videos were created and graded by television manufacturers to showcase the capabilities of HDR over SDR, and hence these contents have a high degree of contrast and colorfulness in order to be eyecatching. The source videos are in the limited YUV420 10-bit format. They were converted to their SDR versions using the NBC LUT.


Fig. 1: Content Descriptors, where 1=VoD videos, 2=Live Sports videos, 3=HDR demo videos, and 4=SJTU videos. Seetext for details.


4) SJTU videos: Six source contents belong to this category. These contents were taken from the open-source SJTU HDR Video Sequence Dataset. They were recorded using a Sony F65 camera and graded using the S-Gamut LUT. The videos are in the limited YUV420 10-bit format and were also converted to their SDR versions using the NBC LUT. All of the source contents in this category are also present in the LIVE HDR database, although the distorted versions differ.


B. Content Descriptors

The Spatial Information (SI), Temporal Information (TI), Colorfulness, and Average Luminance Level were computed for each video sequence and plotted by their groups in Fig. 1. The VoD contents can be characterized as having lower average brightness levels, lower SI, lower TI, and a lower colorfulness index than the other categories. The Live content has high TI. The Demo Videos have a high level of contrast, colorfulness, and brightness since they were designed to showcase HDR’s capabilities. The SJTU videos include many different scenes and hence have a high degree of variability. Their SI and TI are lower than that of Groups 2 and 3, although one video of fireworks was an outlier in this group for all four content descriptors.


C. Processing of Source Sequences

The VoD contents include studio-graded HDR and SDR source versions, while the other content categories are pristine HDR versions which were also converted to SDR versions using the NBC LUT. The HDR and SDR versions were encoded using the bitrate-capped Constant Rate Factor (CRF) method of the x265 encoder. The maximum bitrates (maxrate), CRFs, and buffer sizes for each resolution are listed in Table I.


Nine HDR videos of the EPL6 content (from the Live Sports category) were taken from the APV HDR Sports database, nine HDR videos of the NightTraffic content (from the SJTU category were taken from the LIVE HDR database, and nine HDR videos of the BTFB-01h05m40s content were taken from the LIVE AQ HDR database. Each group of nine videos included the pristine version as well as eight compressed versions. These 27 “anchor” videos were used to estimate a mapping between the scores from prior HDR databases to the scores in the new HDR vs SDR database. Each of the anchor contents also had 6 compressed SDR versions at the maxrates and resolutions shown in Table I, as well as a pristine SDR version. Hence each anchor content was associated with 16 video sequences each (9 HDR versions and 7 SDR versions).


The remaining 22 source contents each were associated a pristine HDR version, a pristine SDR version, six compressed HDR versions and six compressed SDR versions, for a total of 308 videos. These videos, combined with those from the anchor content, yield a total of 356 videos in the new database. There are thus a total of 181 HDR videos and 175 SDR videos in the database.


D. Display Devices

Three televisions were selected for the study: the 65” Samsung S95T (TV1), 65” Samsung Q90 (TV2), and the 55” Amazon Fire TV (TV3). All of the televisions are capable of receiving and displaying HDR10 content. Peak luminance refers to the maximum brightness a television can get when displaying HDR. Different televisions advertise different “peak luminance” capabilities, but they may only be able to generate the advertised value in a small portion of the screen for a short duration. This is done to prevent damage to the display and to reduce power consumption, and is referred to as Auto Brightness Limiting. We refer to the the peak luminance as the instantaneous brightness of a white rectangle displayed on an area covering 2% of the screen, as reported by [28].


The Samsung S95 has a quantum dot organic light-emitting diode (QD-OLED) display. Each pixel emits its own light, and hence the device can show very high contrasts. It has 86.93% coverage of the BT 2020 color space and a peak luminance of 1028 cd/m2 . The Samsung Q90T is a Quantum Dot display with a Vertical Alignment (VA) LED backlight. Quantum Dots emit red and green colors with high accuracy, and can hence produce more vivid colors than standard LED TVs. The Samsung Q90 has 67.24% coverage of the BT 2020 color space and a peak luminance of 1170 cd/m2 . The Amazon Fire TV is an entry-level VA LED TV with a 54.25% coverage of the Rec 2020 color gamut and a peak luminance of 230 cd/m2 . The Samsung Q90T and the Amazon Fire TV have the full array local dimming (FALD) feature that dims the backlight brightness in areas of the screen that are meant to be darker in order to increase the contrast. However, due to the backlight, brightness can still “bleed” from brighter areas of the screen to darker areas, which can affect contrast. The Samsung S95, on the other hand, is an OLED display, hence each pixel can be controlled individually for better contrasts than FALD can allow. However, the Samsung Q90T can achieve higher peak brightness values than the S95 because of the presence of the LED backlight as well as the Quantum Dots, which amplify light.


TABLE I


A Windows PC with an NVIDIA 3090 GPU running the Windows 10 Operating System was used to drive the televisions via a HDMI 2.1 cable. The PC and the televisions had HDR enabled. The screen resolutions were set at 3840x2160 and the refresh rate was set at 30 Hz. The VLC media player was used for video playback.


E. Subjects

A total of 67 students at the University of Texas at Austin volunteered to participate in the human study. All the subjects were between the ages of 20 and 28. Approximately two-thirds of the subjects were male, and the remaining third identified as female. A demographic survey revealed that 73% of the subjects identified as Asian, 20% identified as White, and 7% identified as Black.


Among these, 22 subjects were assigned to watch TV1, 21 were assigned to watch TV2, and 24 were assigned to watch TV3. None of the subjects were told about the other TVs or the nature of the study in order to eliminate biases. All the subjects passed the Ishihara test for color-blindness and the Snellen test for visual acuity when wearing their corrective lenses (if needed).


F. Subjective Testing Design

We employed a Single Stimulus method for the study, as described in ITU-R BT 500.13 [29]. Each video was shown once to each subjects and a quality score was collected from the subject immediately after the video was shown. The videos were displayed in random order and the reference and distorted videos were not identified or given different treatment. Videos of the same content were not allowed to be adjacent in the viewing order, in order to reduce memory biases. Each quality score was collected using an invisible integer scale from 1-100 with a continuous slider that had 5 verbal markers: “Poor,” “Bad,” “Fair,” “Good,” and “Excellent.” The slider was operated by a mouse. The study was divided into two sessions of approximately 40 minutes each, and the two sessions were separated by at least 24 hours to reduce viewer fatigue.


Before each test session began, a training session was conducted whereby the subject was familiarized with the setup using videos that are not a part of the database. A set of six videos of the same content at varying levels of compression were shown to each subject, three being in HDR and three being in SDR, such that the quality range present in the database was fairly represented. Subjects were shown how to use the scoring mechanism. During the training session, instructions were given on how to rate the videos based on their subjective assessment of the quality, while avoiding judgments on the aesthetic content. No other instructions or details about the study were given to avoid biasing the participants.


This paper is available on arxiv under CC 4.0 license.