Tuning for Metrics: What About VMAF and VP9?

If you’re comparing codecs with video quality metrics, you should consider tuning for that metric. However, x264 and x265 don’t have a VMAF tuning option. According to my analysis, it appears that tuning for PSNR is the best option and one you should strongly consider. When working with VP9, there’s an additional complication; tuning for PSNR doesn’t appear to work. 

I was working on some VP9-related additions to my Streaming 101 course and decided to run some quality comparisons between VP9, x264, and x265. Since I planned to use VMAF as the quality metric, the issue of “tuning” came to mind.

By way of background, when you encode video with different codecs and plan to assess quality via video quality metrics, you should tune for metrics, typically via a switch like-tune psnr in FFmpeg. According to the documentation provided by various codec developers, tuning doesn’t activate encoding techniques that improve the score, it disables techniques that can decrease the score, a subtle but critical difference.

As I explained here, its:

“Older metrics like Mean Square Error (MSE) and Peak Signal to Noise Ratio (PSNR) work by analyzing the differences between pixels in the source and the compressed file. The more differences, the lower the score.

Codec vendors don’t care about scores; they care about subjective quality. Accordingly, there are multiple classes of techniques that improve visual quality but can degrade metric scores because they increase the “differences” between the source and compressed file. One example is adaptive quantization (AQ). In his book, Decode to Encode, author Avinash Ramachandran explains, “As we know, our eyes are more sensitive to flatter areas in the scene and are less sensitive to areas with final details and higher textures. AQ algorithms leverage this to increase the quant offset in higher textured areas and decrease it in flatter areas. Thus, more bits are given to areas where the eyes are sensitive to visual quality impacts.” These types of allocations improve perceived visual quality but also increase differences and lower metric scores.

When you-tune psnrin FFmpeg, you set adaptive quantization to mode 0, which disables AQ, and also disable psycho visual optimization. Again, these optimizations improve perceived quality, but since they increase the differences between the source and the compressed video, it lowers the PSNR score.

All this is straightforward when using a codec with working tuning mechanisms for the metric you intend to use. At this point, none of the codecs I was testing can tune for VMAF. At one level, this makes sense, because VMAF is supposed to assess video quality more similarly to subjective viewers than PSNR or SSIM.

Still, I decided to run some tests to identify the optimal strategy. With all three codecs, I encoded a baseline file at 4 Mbps without tuning and then two other files using identical parameters except tuned for PSNR and SSIM. I posted the scores to a table and marked the highest score as green and the lowest as orange.

The first table shows x264. Starting on the right, the file tuned for PSNR scored the highest PSNR rating, as did the file tuned for SSIM with that metric. As mentioned, x264 doesn’t have a tuning mechanism for VMAF but tuning for PSNR produced the highest VMAF score by as much as about 2 VMAF points. In both cases, tuning for SSIM improved the VMAF score over the baseline file.

The next table shows x265 and a similar pattern, though the tuned vs. non-tuned delta in the VMAF score was much less than with x264.

The final table shows VP9. Here we see that tuning for PSNR produced identical scores as the baseline file, which wasn’t really a surprise given that the two files had the exact same size. So, tuning for PSNR doesn’t appear to work. While tuning for SSIM did produce the best SSIM score, VMAF ratings for the SSIM tuned clip was significantly lower than the untuned file.

In addition, the VMAF delta between the SSIM tuned and untuned clip was much higher than with x265, though not as significant as x264.

I checked to see if I could manually disable adaptive quantization for VP9, but the FFmpeg help files showed AQ disabled by default.

As between VP9 and x265, the difference between tuning and not-tuning was very minor, about 0.12 VMAF points on average between the two clips. Between x265 and x264 however, tuning reduced x265’s advantage from an average of 4.39 VMAF points to an average of 2.72, which will have a significant impact on the RD curves and BD-Rate computations. So, when measuring with VMAF, it appears appropriate to tune for PSNR with x264 and x265, and to tune for PSNR with VP9, though this will deliver the exact same file as not tuning (hey, at least you did the right thing in the command string).

When testing with VMAF or PSNR with VP9 you should definitely not tune for SSIM as that will meaningfully reduce the scores for both metrics.

I’ll post the overall results of my study in a couple of weeks.

Anyone with other data or thoughts is welcome to leave a comment, or contact me at janozer@gmail.com.

These findings will ultimately make their way into Streaming 101: Technical Onboarding for Streaming Media Professionals and Computing and Using Video Quality Metrics. Click the course emblems below for more information.

About Jan Ozer

Avatar photo
I help companies train new technical hires in streaming media-related positions; I also help companies optimize their codec selections and encoding stacks, and evaluate new encoders and codecs.

Check Also

Choosing a Preset for SVT-AV1 (Or Any Codec)

This post explores the economic factors to consider when choosing a preset for the SVT-AV1 …