I was reminded of this recently as I prepared for a talk on AV1 readiness at the upcoming United Cloud Tech Talk. Though quality is only a single factor, I wanted to nail the quality comparisons for the talk, but didn’t have time to produce all the iterations myself. It’s been a while since I benchmarked H.264, HEVC, AV1, and VVC, so I scouted around for more recent reviews.
Pulling together the disparate data points highlighted the challenge of coming up with a single metric to define AV1 quality, and the concept that became the title of this article.
FFmpeg Codecs
The first results came from an article entitled, Performance Comparison of VVC, AV1, HEVC, and AVC for High Resolutions. Here the authors compared FFmpeg codecs, including x264, x265, SVT-AV1, and VVenC, the Fraunhofer implementation of VVC, which the authors added via a patch to FFmpeg. These results are probably most relevant to the majority of readers because most publishers use these FFmpeg codecs, rather than licensable codecs that might deliver better quality. This is true, even though x265, VVenC and SVT-AV1 are middle of the road performance wise, as you’ll see in a moment.
The authors did a nice job testing a number of clips at various resolutions, which I compiled into the numbers shown in Table 1, showing averaged PSNR results. With these numbers, using x264 instead of x265 costs you 70.7% in bandwidth, and AV1 and VVC can save you about 25% and 53% respectively.
Digging a bit deeper in the command strings, the researchers used the default preset for all encodes, which is defendable, if perhaps not relevant for producers who tend to use higher quality presets. In general, I’ve found that if your videos will be viewed over 400 – 500 times, it pays to use a higher quality preset. Obviously, this is because you can reduce the bitrate while maintaining the same quality, offsetting the extra encoding cost with reduced bandwidth expenses. This is why I tend to compare codec implementations using higher quality presets, though again, so long as you’re consistent, the results are defensible.
Less defensible is the decision to not tune for metrics during encoding which was not discussed by the authors. Tuning is one of the most vexatious decisions involved in codec implementation comparisons. Briefly, when you tune for metrics, you disable configuration options that enhance the visual appearance of the video at the expense of metric scores.
For example, Adaptive Quantization is designed to make the video more pleasing for the human eye but introduces more “differences” between the source and encoded file, which metrics typically consider errors. So, even though AQ might make an x264 encoded file look better to human eyes, it might reduce the VMAF score by 2-3 points. When you tune for PSNR or SSIM with x264, you disable psycho visual adjustments that reduce the metric score.
To identify the proper tuning strategy for a multi-codec implementation comparison, you have to know which adjustments the codec implements by default. If they are enabled, you select a tuning mechanism or disable them manually. If they aren’t implemented by default, you don’t tune. As you can imagine, this adds a ton of complexity to an already complex analysis. I’ve gone both ways on this one; tuning for metrics in this December 2020 article, and not tuning for metrics a year later, which the Fraunhofer team understandably took issue with.
Having wavered in the past, I don’t have a great recommendation here. I will note that in the past, I asked Netflix about this, and they responded, “Since VMAF partially captures the benefit of perceptual optimization, and if at the end of the day you will be encoding with these settings on, we still recommend turning them on.” Note that this recommendation is counter to what most codec developers advise.
Hopefully, at some point in the future, VMAF, or another AI-based metric, will become so aligned with human scores that tuning won’t be necessary. But, we’re not there yet. Still, since the authors didn’t choose to tune in their analysis, you’d expect to achieve different results in studies that do tune, like the Moscow State Analysis we address next.
Not only did MSU tune, they also didn’t use the default presets in most cases. So, as you would expect, this produced significant differences between the FFmpeg codec results. However, the biggest differences related to the fact that MSU tested multiple codecs that are not included with FFmpeg and must be licensed. Given the improved efficiency, licensing makes perfect sense for publishers in the top 1% of the market pyramid.
The Moscow State University Analysis
I’ll briefly discuss two MSU reports. In the first, the MSU Video Codecs Comparison 2023-2024: Part I, FullHD, Moscow State analyzed x264, x265, SVT-AV1, VVenC and multiple other codecs. You see one chart of many available with the report showing VMAF scores for the slow (1fps) encode most relevant to VOD producers.
Table 2 shows some of these numbers plugged into the previous table, again with x265 as the 100%, though the most efficient HEVC codec, from Tencent, drops the benefit of HEVC by 57%. You see that the penalty for using x264 is much higher than the previous analysis, 246% compared to 170.7%.
Though SVT-AV1 slots pretty similarly, the most efficient AV1 codec increases the savings delivered by AV1 by an additional 36%. MSU found VVenC considerably less efficient than the first study, but the Tencent version of VVC again delivered another substantial savings, bringing the net benefit of VVC down another 16%.
The MSU numbers make two clear points; first, confirming the title that there are no codec comparisons, only codec-implementation comparisons. Second, that if you’re spending heavy dollars on bandwidth for HEVC and AV1 video, you should check out the MSU report and perhaps get in touch with Tencent ([email protected]).
The final numbers come from the MSU Video Codecs Comparison 2023-2024: Part III, FPGA. These are the technologies that you would consider for live transcoding and some high-volume VOD transcoding of UGC and similar non-premium content. In fairness to MSU, I’ll forgo pasting in a table and just input some numbers into table 3, representing VMAF results from 60 fps testing of the most prominent brands that are generally available for purchase worldwide.
Again, MSU benchmarked around the x265 codec, but here the savings from HEVC and AV1 are nowhere near their software counterparts, and VVC isn’t represented at all (yet). The penalty for continuing to use the H.264 codec is nowhere near as significant as for software transcoding, reflecting that H.264 silicon has had many more years to mature performance-wise.
Note that for each report, MSU tests at multiple transcoding speeds and using many variants of SSIM, PSNR, VMAF and sometimes subjective comparisons to gauge the result. These introduce more variables to consider when choosing the ideal hardware or software tool for your encoding job.
Getting back to where I started, there is no one-size fits all comparison that identifies the savings that any new codec can deliver. Not only is there the hardware/software differential, but also the variance in use cases, codecs, and quality metrics. Not to mention that the bandwidth savings that codecs deliver vary by whether you’re distributing a single file or an encoding ladder, and the distribution pattern of the encoding ladder itself. If you’re delivering the top rung to 99% of your viewers, you’ll harvest most of the savings. If you’re distributing primarily using middle of the ladder rungs, your QoE may increase, but the bandwidth savings will be much less.
For the purposes of this article, the key point is that unless the study’s testing schema and codec selection aligns perfectly with yours, the results aren’t likely to accurately predict the actual savings any new codec will deliver. It’s highly unlikely that any single codec comparison will be relevant to your scenario, though the various schemas and metrics that MSU deploys make its tests extremely useful. In most cases, you’re just going to have to get the codecs in yourselves and run your own tests.