Conference Research Tests Adaptive Video and Quality Benchmarks

The Society for Imaging Science and Technology hosts the annual International Symposium on Electronic Imaging, held this year in San Francisco, California, from February 14 to 18. The Symposium has eight tracks across a range of disciplines, where researchers from industry and academia present papers and findings.

I attended primarily to learn the latest in two arenas: adaptive streaming and video quality benchmarks. In this article, I’ll present an overview of the sessions and papers I found most interesting and relevant. The first two relate to work on adaptive streaming performed by Google. The second two discuss how to measure the quality of adaptive streaming experiences.

A Subjective Study for the Design of Multi-Resolution ABR Video Streams With the VP9 Codec

One common problem facing encoding professionals is identifying when to switch between streams in an adaptive group. This paper, authored by Chao Chen, Sasi Inguva, and Anil Kokaram from YouTube/Google, presented a hybrid objective/subjective technique for identifying the appropriate data rate for switching stream resolutions. Though the experiment focused on the 4K/2K decision point using the VP9 codec, the technique can be used for any decision point and codec.

Adaptive streaming involves a group of encoding configurations at various resolution and data rate pairs. At each data rate in the ladder, the player has to choose the appropriate resolution. Intuitively, it’s the resolution that delivers the highest quality at that data rate, as shown in Figure 1, and Google broke no new ground in sharing this observation.


Figure 1. Theoretically, you want to switch resolutions to maximize quality throughout the encoding ladder.

As mentioned, Google’s focus was on the appropriate data rate to switch between 2K and 4K videos, and the short answer is that it’s between 4 Mbps and 5 Mbps when encoding with the VP9 codec. How Google got there is the interesting part.

Google selected 7966 4K videos uploaded to YouTube, created 2K versions, encoded both the 4K and 2K versions with VP9 at various data rates, and computed their Structural Similarity Index (SSIM) scores. Based upon these scores, the average switching rate between 2K and 4K was 4 Mbps. That is, below 4 Mbps, most 2K clips had a higher SSIM rating, while above 4 Mbps, the 4K videos had a higher SSIM rating.

To test this premise, Google ran subjective tests, but even Google doesn’t have the time, patience, or funds to test 7966 videos. In fact, The Google researchers wanted to subjectively test just 10. So the question was how to choose 10 videos that represent the entire universe of 4K clips that are and ever will be uploaded to YouTube. Not surprisingly, the answer is a bit wonky, though comprehensible if you simply trust Google's math.

The researchers reasoned that the encoding complexity of videos involve two main factors: the amount of motion in the clip and the amount of detail. To assess the amount of detail, the authors used I-frame size, since at a constant quantization parameter, more detail requires a larger file size to preserve. To measure the amount of motion in a clip, the researchers used average P-frame size divided by I-frame size, to decouple the effect that large I-frames can have on P-frames (this is the trust the math part).


Figure 2. Differentiating clips based upon the amount of motion and detail.

These metrics decided upon, the researchers encoded 3226 of the highest-quality clips in the 4K library to H.264 using FFMPEG with a constant quantization parameter of 28. Then, they measured I frame and I/P frame size and plotted the graph shown in Figure 2, which in essence creates a 9-slot taxonomy of 4K clips based upon the amount of motion and detail. In each region, they selected 20 clips closes to the center of each region, and selected the highest quality clip.

From these clips, they produced 2K variants, and encoded these variants and the original 4K clips to 2 Mbps, 3 Mbps, 6 Mbps, and 11 Mbps. Then they scaled the 2K versions back to 4K for side-by-side subjective testing. The result was an average switching rate of about 5 Mbps. From this, the authors concluded: “In this sense, SSIM is probably a good quality index for the purpose of estimating the average resolution switching bitrate for large amount of videos. Although SSIM may overestimate or underestimate the quality for a particular video, its estimation error will be averaged out when estimating average quality for a large collections of videos.”

What’s significant about this study? Multiple items. First, it validates using SSIM as the basis for determining how to configure streams in adaptive groups. Second, the 4/5 Mbps switch point between 2K/4K video is interesting, though this will vary from codec to codec. Finally, if you find yourself having to select a limited set of clips that accurately reflect the characteristics of a larger group, the I and P-frame/I-frame technique described might just do the trick.

Optimizing Transcode Quality Targets Using a Neural Network With an Embedded Bitrate Model

One of the more significant recent events in the encoding world was Netflix’s per-title encode blog post where the authors discussed their schema for creating a custom encoding ladder for each video distributed by the service. Netflix’s approach involves multiple trial encodes, which works well when you distribute a large, but limited set of content. The compressionists at YouTube have a completely different problem to manage; In essence, how to pull off per-title encoding when you have 300 hours of video uploaded every minute of every day. This talk, and the above titled paper, discussed their approach.

The paper, authored by Google’s Michele Covell, Martin Arjovsky, Yao-chung Lin, and Anil Kokaram, starts by describing the conditions that YouTube must work under. First, YouTube encodes files in parallel, splitting each source into chunks and then sending them off to different encoding instances. Since communications between these instances would complicate system design and operation, the solution couldn’t involve communications between these instances.

Second, any approach must be codec agnostic, because YouTube deploys multiple codecs. To make this work, the solution had to depend upon a single rate control parameter for each codec, though it can vary from codec to codec. For x.264, which was the focus of the paper, YouTube used the Constant Rate Factor (CRF) value as the single rate control parameter.

CRF is a rate control technique that adjusts the quantization level to optimize quality over the duration of the file (or file segment). The problem with CRF is that it has no rate control mechanism; you set the CRF value, and x264 produces a file at whatever data rate is necessary to meet the selected quality level. YouTube’s files have to meet a target data rate, so the object of the exercise was how to choose the CRF level that would deliver the required data rate.

One obvious solution would be to run a first encoding pass on all incoming files, and distribute this information to all encoding instances. However, implementing two-pass encoding would dramatically increase the encoding horsepower necessary to process the incoming load. For this reason, YouTube had to implement the solution, if at all possible, in a single pass.

As the paper describes, while YouTube can’t afford a first pass on all incoming files, it does gather some information from a high-bitrate mezzanine file produced from all incoming files. Essentially, because users upload files in a variety of formats, sizes, bit rates, and frame rates, this mezz file is necessary to normalize these files before encoding. When creating this mezz file, YouTube gleans many details about the file, though not up to the level of information gained from a true first-encoding pass.

Schooling the Neural Network

The issue was how to predict the right CRF value from this limited information, and for this, YouTube deployed a neural network. At a high level, a neural network is a multiple-CPU system with the ability to learn via training. To train the network, YouTube performed over 137,000 encodes on 14,000 clips, and fed the data into the network (Figure 3). The researchers then encoded 1,000 test clips based upon input from the network and found that the system choose the right CRF value to meet the target data rate 65 percent of the time, with a tolerable bitrate error of under 20 percent. This would mean that 35 percent of the clips would have to be re-encoded to meet the target bitrate.


Figure 3. Training the YouTube neural network.

The researchers next evaluated the learning benefit of incorporating the results of a fast, low-quality CRF encode into the system. Specifically, the system encoded a 240 pixel height video file at a CRF value of 40, and incorporated data from this encode into the neural network training. This boosted accuracy to 80 percent, which means that only 20 percent of the files needed re-encoding.

It’s tough to say how the average compressionist might use this research, though it does provide a fascinating look into the scale of YouTube’s operations, and an interesting example at what neural networks are and the type of work that they can perform. Whatever technique you use to optimize encodes, however, if you’re not thinking about per-title or per-category encoding optimization, you’re behind the curve.

Subjective Analysis and Objective Characterization of Adaptive Bitrate Videos

Assessing the quality of a single video file via subjective and objective testing is well travelled ground. However, the Quality of Experience (QoE) of adaptive streaming is much more complicated, using multiple streams with different quality levels and different algorithms to determine when and how often to switch streams. This paper, authored by Jacob Søgaard (Technical University of Denmark), Samira Tavakoli (Universidad Politécnica de Madrid), Kjell Brunnström (Acreo Swedish ICT AB and Mid Sweden University), and Narciso García (Universidad Politécnica de Madrid), provides a great explanation about the types of testing performed to assess the QoE of adaptive streaming. Unfortunately, it shows that highly accessible and easy to apply objective tests are poor predictors of actual subjective ratings.

Rating the QoE of Adaptive Streaming

Near the start of their paper, the authors reference a highly useful paper entitled Quality of Experience and HTTP Adaptive Streaming: A Review of Subjective Studies, which I Googled and was able to download. I suggest that you do the same. As the title suggests, this paper reviewed previous studies and summarized their conclusions, which are relevant to all streaming producers.

One of the key issues tackled was the frequency of stream switching, which we in part control as producers via the number of streams created for each source. Create many variations close together, and you’ll have lots of stream switching; create a lower number over a wider range, and you’ll have fewer stream switches.

Regarding the QoE impact of stream switching, conclusions are mixed. Most studies found that frequent stream switching reduced QoE, which seemed to argue for fewer streams. Other studies concluded that viewers preferred multiple, gradual variations over a single abrupt variation, which seemed to argue for more streams.

The type of content produced should definitely play a role in the switching strategy. Stream switching was less noticeable in content with frequent camera angle or scene changes, like movies or sports videos, than with more steady content, like a single camera lecture or conference. This makes perfect sense, since your attention is drawn away from video quality while catching up with the new scenes unfolding in front of you. The takeaway: Movies, sports, and similar content can benefit from more streams in the adaptive group because the changing cameras and camera angles mask switching from the viewer. Conference or training videos, particularly those with a single camera angle, benefit from fewer streams, as most viewers prefer even a lower quality stream over frequent switching.

One key decision facing video producers is which stream to deliver first. Some prefer a low quality stream that gets the video playing fast; others want a high quality stream that may buffer for a bit but will deliver a good first impression. Studies referenced in this paper showed “that a low startup bitrate followed by slow increase ("ramp-up") of quality clearly degrades the QoE.”

With HTTP Live Streaming, the first item specified in the M3U8 playlist is the first stream seen by the viewer. These studies clearly show that, as Apple recommends in TN2224, producers should deploy different playlists for mobile, desktop, and perhaps even OTT with different initial streams. Specifically, TN2224 states, “You should create multiple playlists that have the same set of streams, but each with a different first entry that is appropriate for the target network. This ensures the user has a good experience when the stream is first played.”

While the TechNote’s recommendations of a 150 kbps stream for mobile and 440 kbps for Wi-Fi seem conservative, the point is clear: What the viewer sees first sets the tone. The best strategy is to set the first stream at the highest possible quality the viewer can sustain.

Back on Point

Getting back to the current paper, the authors largely focused their attention on the best methodology to test adaptive streaming strategies. For their subjective tests, the authors used 7 6-minute clips encoded to 4 quality levels and compressed into 2- and 10-second chunk sizes. These were assembled into test sequences of increasing and decreasing quality, with gradual and rapid quality changes. The entire clips at constant quality were also subjectively reviewed. In all, the authors created 132 different test sequences that were used for multiple tests.

To be clear, each test sequence was a “canned” adaptive streaming experience, complete with programmed stream switching. That way, the authors could systematically compare the perceived viewer quality of a stream with many or few stream changes, or with abrupt or gradual stream changes (Figure 4).


Figure 4. Focusing on the perceived quality of the HTTP Adaptive Stream, rather than the quality of each video segment.

The authors had used these particular tests sequences previously, but in a different way. Specifically, in previous tests, they broke each six-minute test video into multiple shorter sequences containing individual switching events, with a quality assessment after each shorter sequence. These tests were administered with and without audio.

In the tests documented in this article, the researchers tested subjective quality only after playing the entire six-minute test sequence. The specific question they wanted to answer was whether or not the results in the shorter tests accurately predicted the results from the longer test. This is important because shorter tests are much easier on the test subjects.

While both sets of shorter tests proved relatively accurate at predicting the results of the longer test, the shorter tests without audio were almost an exact match. The authors postulated that the lack of audio in the initial tests allowed subjects to focus on video quality, producing more accurate results.

The authors also tested whether or not objective, non-referential tests that focused on the conditions that might have produced the subjective scores—such as blockiness, blur, brightness, and noise—were an accurate predictor of subjective ratings. Briefly, non-referential tests examine only the encoded video itself, and don’t compare the encoded video to the source. In contrast, full-reference benchmarks like Peak Signal-to-Noise ratio (PSNR) and the aforementioned SSIM metric compute their scores by comparing the encoded video back to the source, which is much more time consuming and challenging, because it requires the source file to measure quality. If the researchers could prove that the non-referential benchmarks tested had a high correlation with the results of the subjective tests, it would have been a huge benefit for researchers, since non-referential tests are fast, easy, and inexpensive to apply.

No joy here, however, as the researchers found that these tests were a poor predictor of overall subjective quality, though performance could be improved by clustering videos based on spatial and temporal characteristics. The bottom line is that producers can’t generally use these tests to predict the quality of an adaptive streaming experience.

Applicability of Existing Objective Metrics of Perceptual Quality for Adaptive Video Streaming

In this related paper and presentation, written by Jacob Søgaard (Technical University of Denmark), Luk´a?s Krasula (Czech Technical University in Prague and LUNAM University), Muhammad Shahid (Blekinge Institute of Technology), Dogancan Temel (Georgia Institute of Technology), Kjell Brunnström (Acreo Swedish ICT AB and Mid Sweden University), and Manzoor Razaak (Kingston University), the authors extended the limited set of non-referential objective metrics tested in the previous paper to a range of full reference and non-reference objective metrics such as PSNR, SSIM, and VQM. Again, the specific issue was whether or not these metrics could predict the subjective scores of the complete six-minute adaptive streaming experience, with stream switches and all, not the subjective video quality of a single compressed stream.


Figure 5. A generally low correlation between VQM-VFD scores and subjective rankings

Again, the answer was no joy, though two tests, the VQM-VFD (Figure 5) and PEVQ-S showed some promise. As an example, while VQM-VFD is a sophisticated metric that deploys a neural network to interpret the inputs, the scatter graph shown in Figure 5 shows a generally low correlation between the Mean Opinion Scores (MOS) reported by the subjective tests and the VQM-VFD test results. Overall, the authors concluded, “[u]pon experimenting with existing objective methods for their applicability on such videos, we observed that these methods, which are known to perform well otherwise, severally fall short in accuracy in this case.” The bottom line is that just because a tool performs well for one video distribution system (e.g. IPTV over UDP), producers can’t assume that it will perform equally well for assessing the quality of an adaptive streaming experience.

Comments (0)

Post a Comment
* Your Name:
* Your Email:
(not publicly displayed)
Reply Notification:
Approval Notification:
* Security Image:
Security Image Generate new
Copy the numbers and letters from the security image:
* Message: