As we transition from H.264 to VP9, HEVC, AV1, and soon VVC (Versatile Video Coding), it’s important to understand the fundamentals of codec comparisons and how to evaluate their effectiveness and utility. In this expanded column I’ll cover both.
Evaluating the Evaluation
Let’s begin with how to evaluate the evaluation. I start by identifying the evaluator and its affiliations, giving more credibility to actual users of the technology, like Netflix or Facebook, than to vendors. Though both are members of the Alliance for Open Media, and so have some degree of bias, a staffer who publishes a paper detailing a certain quality level knows he’ll have to deliver that quality when it’s time to deploy.
At the other end of the credibility spectrum are reports prepared by non-practicing companies affiliated with one of the HEVC patent groups. They’re not actually using any video technology at scale, and they have a clear financial incentive to find their technology superior.
When reviewing reports from research and technology shops, like Moscow State University (MSU), I focus on who funded the report. MSU funds most of its own reports, so I give that place great credibility. If a report is funded by a third party, I look at the interests of that party.
Next, I identify which version of the codec is actually evaluated. Remember that there are multiple HEVC, H.264, VP9, and even AV1 codecs, each with different dynamics. HEVC proponents assert that the HEVC reference codec is the true gauge of encoding quality, though this codec isn’t commercially used. My preference is to compare commercially available codecs, particularly those used at scale, like x264 or x265, or AV1 as delivered in FFmpeg 4.x.
Then I consider the version of that codec, which is a concern for slow-moving academic papers that can take months to get from testing to publication. AV1, in particular, will change significantly over the next few months, so a review that’s 9 to 12 months old may differ dramatically from what’s currently available.
Then I look at how the encoding parameters for each codec are derived. I ask the codec vendors to supply encoding parameters, eliminating any bias or learning curve. MSU does the same. I tend to discount any study that doesn’t consult with the codec vendors.
I also consider how many clips are deployed and their composition. More clips are better, and they should be diverse in terms of motion, complexity, and real-world and animated content.
Finally, I consider the operational bias of the tester. For example, Facebook evaluated AV1 for VOD distribution to millions of viewers, which minimizes the impact of encoding time/ cost. While useful for publishers with similar volume, this data isn’t meaningful for smaller producers and is completely irrelevant for live producers.
Once you’ve considered the pedigree and focus of the evaluator and study, it’s time to understand the components and results.
There are two ways to analyze encoded files—using actual viewers or using objective quality metrics like Peak Signal-to-Noise Ratio (PSNR), Structural SIMilarity Index (SSIM), Video Multimethod Assessment Fusion (VMAF), or SSIMPLUS from SSIMWAVE. Objective quality metrics exist to predict subjective scoring, but subjective comparisons are the gold standard. However, producing subjective evaluations is expensive and time-consuming, which is why objective metrics are so frequently used.
In my consulting work and writing, I prefer VMAF and SSIMPLUS over PSNR or SSIM, but that’s my idiosyncratic bias. If you’re familiar with objective metrics, you likely have your own bias. Otherwise, you should evaluate the metric based on who is using it. Obviously, Facebook wouldn’t quote PSNR/SSIM stats if it felt they were irrelevant, and PSNR hasn’t become obsolete in the 2.5 years since Netflix stopped using it to drive its impressive encoding engine.
When using objective metrics, the results are typically shown via a rate-distortion curve (see Figure 1). To produce this, you encode a file or files at multiple data rates, score the different videos, and plot the results. Figure 1 shows the average results for two 1080p files encoded at six data rates using the x265, VP9, x264, and AV1 codecs in FFmpeg 4.x.
A rate-distortion curve for AV1, x265, VP9, and x264
When reviewing a rate-distortion curve, consider two things. First, are the data rates relevant to your codec usage? If a 1080p curve goes up to 20Mbps, it may be useful for live encoding, but not for VOD, where 1080p data rates for VP9, HEVC, and particularly AV1 should be 4Mbps or lower.
Second, find the quality bar for each particular metric. With VMAF, a score of 93 and higher predicts that the video is free from annoying artifacts. For PSNR, the magic number is 45 dB; with SSIM, it’s 0.95. Using this as a reference, you can gauge how much bandwidth the codec actually saves you at the quality level where you would typically seek to distribute your video.
Or you can use the BD-Rate result (Figure 2), which stands for Bjøntegaard metric, and calculates the data rate savings delivered by one codec over another. This is computed from the same data shown in Figure 1 and predicts that, within the range of curves displayed in the figure, AV1 will deliver the equivalent quality as x265 at roughly 82 percent the data rate, the same quality as VP9 about 69 percent the data rate, and the same quality as x264 at roughly 50 percent the data rate.
BD-Rate is the bottom line of codec comparison, and a great way to summarize the results.
As video quality measurement has become more important than ever, I hope this backgrounder offers you some guidance in what metrics to use, and when to use them.