The Netflix blog entitled, Per-Title Encode Optimization, boldly declares that “to deliver the best quality video to our members, each title should receive a unique bitrate ladder, tailored to its specific complexity characteristics.” In a world where many companies simply deploy Apple’s recommendations from TN2224 without modification, it’s a breath of fresh air. The blog post goes on to detail how Netflix creates their per-title encoding ladders.
While the Netflix post provides some valuable universal truths, there are also some mysteries that make these truths challenging to apply. Also, since Netflix is a subscription service, there are some caveats that should be considered by companies who aren’t in similar businesses. After a quick overview, I’ll discuss these truths, mysteries, and caveats.
As you’ll read at the end of this post, on January 26th, 2016, at 2:00 PM EST, I’ll be hosting a webinar detailing lessons learned from the Netflix post, and describing a procedure companies can use to implement the content-aware encoding that Netflix so strongly advocates. If you want to skip the article and click to the webinar description, click here.
The encoding world has long been dominated by one-size-fits all encoding “ladders,” or resolution/bitrate pairs. In the blog post, Netflix shared that they had previously used the following combinations to produce “good quality encodes” for most content.
Table 1. Netflix’s traditional one-size-fits-all bitrate ladder.
Netflix then described the problem with this approach, which is that for some challenging videos, “the highest 5800 kbps stream would still exhibit blockiness.” At the other end of the spectrum, “for simple content like cartoons, 5800 kbps is far more than needed to produce excellent 1080p encodes. In addition, a customer whose network bandwidth is constrained to 1750 kbps might be able to watch the cartoon at HD resolution, instead of the SD resolution specified by the ladder above.” In short, each video has a unique complexity, and a single encoding ladder can’t optimize the efficiency or viewing experience for all viewers.
To represent this “very high diversity in signal characteristics” of the videos the Netflix encodes, the blog presented the following graph, which showed 100 files encoded using x264’s constant QP (quantization parameter), which encodes each file to a consistent quality. At a high level, QP encoding seeks to deliver a certain quality level, and varies the data rate to achieve this. Netflix is measuring the quality using the Peak Signal-to-Noise ratio, where higher scores indicate better quality.
Figure 1. A representation of the bitrate/PSNR of 100 Netflix titles.
To create the graph, Netflix encoded all files at four different QP levels, as you can see by the four points on the bottom lavender-colored line. Looking at that plot, plus the aqua line immediately above it, you can see that even though the QP encoding delivered a very high data rate, the quality level, which was around 38 dB for both files, was comparatively low. This indicates that these files are very challenging to encode.
At the other end of the spectrum, the aqua line pointing nearly vertical at the top of the graph topped out at over 48 dB at 2 mbps, despite using the same QP value as the two encodes at the bottom. That’s dramatically higher quality at less than 10% of the data rate, indicating that that the top aqua line represents a very easy to encode file. As it relates to the compression ladder, these results prove that a one-size-fits-all solution either applies too high a data rate (to the file on top of the graph), or too low a data rate to the files on the bottom.
OK, you get it; some files are hard to compress, some files are easy to compress, so you should encode them using different bitrate ladders. Before moving on to the next point, I wanted to tie PSNR scores to subjective ratings, which Netflix is obviously very qualified to do. Specifically, for that hard to compress file at the bottom of the graph, a PSNR level of 38 dB is “acceptable.” At other points in the discussion, Netflix says that scores under 35 dB will show encoding artifacts, while scores above 45 dB produce no perceptible quality improvements. While I don’t favor PSNR (as explained below), these are all useful data points for those who use the metric.
Plotting the Convex Hull
After establishing that all files needed different encoding ladders, the blog post goes on to describe how Netflix produces the ladder. At a very high level, Netflix runs a number of test encodes at different resolutions and QP values to plot the PSNR quality at each data rate/resolution pair, and uses that to identify the optimum encoding ladder.
One observation made in the post is that while increasing the data rate at the same resolution consistently increases stream quality, these quality increases flatten out once the bitrate goes above a certain threshold. You can see this for the low, mid, and high resolution plots in Figure 2. If you plot a line that includes the peak quality/bitrate efficiency points from all resolutions, you get a “convex hull,” a term describing the shape that most efficiently bounds all data points.
Figure 2. Plotting the convex hull, where each resolution or resolutions delivers maximum quality.
Here my grasp of the math and technique described in the post becomes strained. It seems obvious that for each resolution, the data rate selected would be the point on the convex hull. And Netflix is clear that they produce with a finite set of resolutions. What’s unclear is whether each resolution gets a single encode, or whether Netflix encodes at multiple data rates at the same resolutions.
This statement causes my confusion. “The bitrate selection is also limited to a finite set, where the adjacent bitrates have an increment of roughly 5%.” Does this mean that there are multiple encodes at bitrates roughly 5% apart, or if these are the bitrates for which Netflix tried to ascertain the highest quality resolution, in essence the test targets? Netflix could have resolved this issue by showing one complete bitrate ladder (as requested in a comment), but hasn’t done so.
Note that this is a critical issue. The procedure detailed in the blog post focuses solely on optimizing quality, not on whether the encoding ladder performs well in the context of an adaptive group. In this regard, Apple Tech Note TN 2224 advises producers to keep “adjacent bit rates at a factor of 1.5 to 2 apart.” A seminal Adobe white paper on the topic explains why. “Too many bit rates too close to one another could result in too many stream switches, even with smaller bandwidth fluctuations. Besides the slight overhead in switching, the viewer’s experience with too-frequent quality fluctuations may not be pleasant.“ So one big question is how many adaptive variants are produced for each source file, and how that changes with different content.
Another critical question is the encoding technique actually used for Netflix’s production encodes. Specifically, while Netflix clearly uses QP encoding as a tool to identify the optimal data rate target for each file, it’s unclear whether Netflix encodes the ultimate distribution files via QP, perhaps limiting the data rate via some buffer-related mechanism, or if they fall back to some traditional bitrate-based encoding.
If it’s QP-based encoding with a buffer limitation, it would be interesting to learn if the files adhere to the required 10% variability threshold dictated in Apple Technote TN 2224, or whether Netflix simply ignores this specification. I get the strong feeling that Apple’s dictate is honored more in the breach than in the observance, and I’d love to hear Netflix’s position on this issue.
Beyond these issues, the blog post has one key mystery; specifically, how does Netflix compute PSNR on multiple resolution files as compared to the same source. That is, all tools that I’ve used to compute PSNR require the resolution of the test file and the encoded file to be the same. You can compute PSNR on an encoded 1080p file only by comparing it to the original. You can’t compute PSNR on a 720p file by comparing it to the 1080p file.
The only tool I’m aware of that can perform cross-resolution and device aware testing is the SSIMWave Video Quality-of-Experience Monitor (SQM), which can delivers an SSIMplus rating (see review here). Though SQM can deliver a PSNR score for like resolution source and test files; it can’t for files of disparate resolution. Here’s an excerpt from the SQM manual on the topic.
Figure 3. The only cross-resolution test tool I know can’t produce cross-resolution PSNR values.
My go-to tool for PSNR, SSIM, and VQM testing is the Moscow University Video Quality Measurement Tool, which can’t perform cross-resolution testing of any kind (see review here). I asked my contact at the University of Moscow about cross resolution PSNR testing, and he replied that it was technically possible, but didn’t indicate whether they would implement this at any point in the future.
So how the heck did Netflix compute the PSNR on files with disparate resolutions? There is some clue in the left axis title of the charts with PSNR values (including Figure 1 above) that states “Scaled PSNR.” Apparently, Netflix is scaling the values in some way to account for the resolution differences in the encoded files. It would be good to know what Netflix is doing here to produce these results.
At this point, I should probably also express my bias against the PSNR test as the basis for making compression-related decisions. I explain part of this in my post, “Why I like VQM better than PSNR or SSIM,” and confess to a growing appreciation for the SSIMplus metric, which ties to anticipated viewer ratings, can perform cross-resolution testing, and is device specific. To be fair, Netflix acknowledge PSNR’s deficits by stating:
Although PSNR does not always reflect perceptual quality, it is a simple way to measure the fidelity to the source, gives good indication of quality at the high and low ends of the range (i.e. 45 dB is very good quality, 35 dB will show encoding artifacts), and is a good indication of quality trends within a single title.
I agree that PSNR is a good indication of quality trends in a file, but if it doesn’t “always reflect perceptual quality,” why not base this analysis on a different metric like VQM (scaled as Netflix is scaling PSNR), or SSIMplus? You can browse through some test results that compared PSNR, SSIM, VQM, and SSIMplus in a post titled, The SSIMplus Index for Video Quality-of-Experience Assessment. Note that it was produced by employees of SSIMWave, the developer of SSIMplus.
Interestingly, in their paper the Optimal Set of Video Representations in Adaptive Streaming, a dense work mentioned in a comment on the Netflix blog, the researchers stated that, “we model the satisfaction function as an Video Quality Metric (VQM) score , which is a full-reference metric that has higher correlation with human perception than other MSE-based metrics.” MSE means mean square error, and PSNR is such a metric.
In short, if SSIMplus is better, and VQM is better, why use PSNR, particularly if you’re drawing quality-related conclusions from the scores, not just quality trends?
The caveats are fairly obvious. First, Netflix is a subscription service, so all bandwidth-related to file delivery is fully funded. When ascertaining the highest bitrate, quality is the critical factor for most files.
But what about non-subscriptions services? Of course, all videos are funded one way or another, whether by advertising or by dipping into the marketing or training budget. In most of these non-subscription cases, the maximum data rate is dictated by bandwidth cost, not quality. When cost determines the maximum bitrate, the analysis becomes which resolution delivers the best quality at that data rate, which is why effective cross-resolution testing is so essential. That is, you say 3 mbps is the limit, and the analysis becomes whether 1080p, 720p, or 540p delivers the best quality.
Also, as an OTT service, Netflix displays most videos at full screen. In contrast, most producers delivering shorter content produce for a smaller display window. This is why Apple’s TN2224 has two bitrates at the 640×360 window, while Netflix has none. The first rule of producing for adaptive streaming is to have at least one stream for each window size on your website, and many producers use two or more.
So while the Netflix blog post breaks new ground in justifying content-aware encoding, few producers should apply the whole cloth.
None of the issues or caveats raised above are meant in any way to say that the Netflix scheme is broken, or that the blog post is unclear. As I learned long ago, what’s patently obvious to PhD mathematicians is often beyond the grasp of mere mortals such as myself. And when the experientially-based observations of actual users like Netflix clash with white paper theories proposed by technologists like Adobe or even Apple, I trust the user.
But one of my tiny roles in the compression universe is to help others understand and deploy brilliant advances like those discussed in the Netflix blog post. In trying to do so, I encountered the above-described implementation issues. If I found myself on a barstool next to one of the Netflix authors, these are the questions that I would ask, the points I would request clarification on.
These caveats aside, the Netflix post marks a bold line in the sand that a single encoding ladder is insufficient for companies distributing disparate types of videos. As mentioned above, even if you’re distributing relatively homogenous videos, if you’re using an encoding ladder not customized for your video type, it’s almost certainly suboptimal.
In short, TN2224 is dead (at least for broad-brush implementations); welcome to the new era of content-aware encoding.
Content Aware Encoding-The Webinar
OK, speaking of my tiny role in the compression universe, here’s the sales pitch. I’m all in on content-aware encoding, and truth be told, was before the Netflix post (as an editor and several encoding clients can attest). But the Netflix post crystallized my thoughts on the matter and added some valuable procedural workflows. On January 28, 2016, at 2:00 PM EST, I’ll present a webinar detailing a simple but effective technique for implementing content-aware encoding for your content.
In the webinar, you will learn:
– Lessons learned from the Netflix blog post, including some not shared above
– A simple procedure for identifying the optimal encoding ladder for each category of content (even if you only have one type)
– How to verify the quality of the various stream compositions with objective benchmarks like PSNR, SSIMplus and VQM, and tools like the SSIMWave Quality of Experience Monitor
– How encoding with HEVC changes the equation
During the webinar, you’ll see how I applied the procedure to produce content-aware encoding ladders for high motion video (Tears of Steel), simple animations (Big Buck Bunny), complex animations (Sintel), talking head video (yours truly), screencam videos and videos comprised of PowerPoint and talking heads.
You’ll walk away knowing how to test the encoding complexity of your own footage and how to create a content-aware encoding ladder. You’ll also receive encoding ladder templates you can immediately deploy if you have content in the abov-described categories.
The Webinar will cost $30.72 , and you can click below to sign up for the event.