Just a quick note to let you know about some recent findings relating to the Netflix VMAF metric.
By way of background, VMAF launched in June 2016 as the metric that powered Netflix’s per-title encoding engine, replacing PSNR. The fact that Netflix created and used VMAF gave the metric tremendous credibility and because Netflix open-sourced VMAF it was quickly added to many tools like the Moscow State University Video Quality Measurement Tool, the Hybrik Media Analyzer, and FFmpeg.
Beyond this, VMAF ratings correspond with predicted subjective evaluations, so a score of 80 – 100 predicts an excellent rating. Unlike PSNR, where scores typically range from 30 – 45 dB, and particularly SSIM, where scores range from 0 – 1, VMAF scores range from 0 – 100, providing a nice differentiation when working with multiple files of different resolutions in an encoding ladder.
Other highly useful features add valuable context to VMAF scores. For example, a difference of 6 VMAF points is expected to equal a Just Noticeable Difference, which is the “the minimum level of stimulation that a person can detect 50 percent of the time.” An independent study found that a VMAF score of 93 accurately predicts that the content is “either indistinguishable from original or with noticeable but not annoying distortion.” Unlike PSNR, SSIM, and other older metrics, VMAF is available in the default, phone, and 4K models, so you can gauge the quality on a range of devices.
For these reasons, I use VMAF frequently in my writings and consulting practice. As you would expect, I cover VMAF extensively in my online course on computing and using video quality metrics, including how to compute VMAF with FFmpeg and the tools mentioned above, plus a tool open-sourced by Netflix.
What’s new with VMAF? Well, it turns out that it’s pretty easy to hack with simple contrast adjustments. The first clue came from a paper by several staff members of the Moscow State University graphics department who produced the aforementioned Video Quality Measurement Tool and many codec comparisons.
In the white paper, which you can access here, the researchers tested how different values of unsharp mask and histogram equalization impacted VMAF and SSIM scores. In conclusion, the authors state: “In this paper, we describe video color and contrast transformations which increase the VMAF score while keeping the SSIM score the same or better. The possibility to improve a full-reference metric score after adding any transformations to the distorted image means that the metric can be cheated in some cases.” As the title of the article suggested, they concluded that VMAF could be “hacked.”
I verified this in a review of the iSize BitSave Video Processing technology published in Streaming Media Magazine. Using a simple FFmpeg script that adjusted contrast and unsharp mask, but no other parameters, I boosted the VMAF score of two files significantly as you see in the Table though I wasn’t able to match MSU’s feat of keeping SSIM the same.
Table. Simple contrast adjustments boosted VMAF significantly, but reduced quality according to SSIM.
What’s This Mean?
What’s this mean to you? If you’re using VMAF to evaluate codecs and encoding parameters in your own encoding facility, very little. If you’re evaluating third party products, particularly “pre-processing” products like BitSave’s iSize that claim to significantly improve compression performance, you should view any claims of significant VMAF boosts with skepticism. I ultimately concluded that iSize wasn’t hacking, but any time you see significant changes in contrast in your before and after comparisons, hacking should come to mind.
You can see this if you download this PDF and scroll through the frames. There are three frames for each clip, the first is the clip I produced with FFmpeg and minor contrast adjustments, then the clip encoded without any adjustment (the baseline), and then the clip produced via iSize technology. It’s easiest to see if you load the PDF into Acrobat or any PDF viewer that lets you view a single frame at a time, and switch back and forth between the baseline and other two frames. You’ll see several instances where the contrast is noticeably improved, which resulted in significantly higher VMAF scores. The video looks better, sure, but you could have/should have achieved the same impact by optimizing contrast before encoding.
If you’re testing a preprocessor, try this approach. Encode your video at very aggressive encoding parameters so that artifacts are present. Then, test the pre-processor at the same parameters. If the artifacts are gone, it has improved your compression performance. If the artifacts are still present, but the contrast is improved, there may be some hacking going on.
If you’re interested in learning more about computing and using VMAF and other video quality metrics, click the course image below, or click here.