Just a quick note to let you know about some recent findings relating to the Netflix VMAF metric.
By way of background, VMAF launched in June 2016 as the metric that powered Netflix’s per-title encoding engine, replacing PSNR. The fact that Netflix created and used VMAF gave the metric tremendous credibility and because Netflix open-sourced VMAF it was quickly added to many tools like the Moscow State University Video Quality Measurement Tool, the Hybrik Media Analyzer, and FFmpeg.
Beyond this, VMAF ratings correspond with predicted subjective evaluations, so a score of 80 – 100 predicts an excellent rating. Unlike PSNR, where scores typically range from 30 – 45 dB, and particularly SSIM, where scores range from 0 – 1, VMAF scores range from 0 – 100, providing a nice differentiation when working with multiple files of different resolutions in an encoding ladder.
Other highly useful features add valuable context to VMAF scores. For example, a difference of 6 VMAF points is expected to equal a Just Noticeable Difference, which is the “the minimum level of stimulation that a person can detect 50 percent of the time.” An independent study found that a VMAF score of 93 accurately predicts that the content is “either indistinguishable from original or with noticeable but not annoying distortion.” Unlike PSNR, SSIM, and other older metrics, VMAF is available in the default, phone, and 4K models, so you can gauge the quality on a range of devices.
For these reasons, I use VMAF frequently in my writings and consulting practice. As you would expect, I cover VMAF extensively in my online course on computing and using video quality metrics, including how to compute VMAF with FFmpeg and the tools mentioned above, plus a tool open-sourced by Netflix.
Contents
Hacking VMAF
What’s new with VMAF? Well, it turns out that it’s pretty easy to hack with simple contrast adjustments. The first clue came from a paper by several staff members of the Moscow State University graphics department who produced the aforementioned Video Quality Measurement Tool and many codec comparisons.
In the white paper, which you can access here, the researchers tested how different values of unsharp mask and histogram equalization impacted VMAF and SSIM scores. In conclusion, the authors state: “In this paper, we describe video color and contrast transformations which increase the VMAF score while keeping the SSIM score the same or better. The possibility to improve a full-reference metric score after adding any transformations to the distorted image means that the metric can be cheated in some cases.” As the title of the article suggested, they concluded that VMAF could be “hacked.”
I verified this in a review of the iSize BitSave Video Processing technology published in Streaming Media Magazine. Using a simple FFmpeg script that adjusted contrast and unsharp mask, but no other parameters, I boosted the VMAF score of two files significantly as you see in the Table though I wasn’t able to match MSU’s feat of keeping SSIM the same.
Table. Simple contrast adjustments boosted VMAF significantly, but reduced quality according to SSIM.
What’s This Mean?
What’s this mean to you? If you’re using VMAF to evaluate codecs and encoding parameters in your own encoding facility, very little. If you’re evaluating third party products, particularly “pre-processing” products like BitSave’s iSize that claim to significantly improve compression performance, you should view any claims of significant VMAF boosts with skepticism. I ultimately concluded that iSize wasn’t hacking, but any time you see significant changes in contrast in your before and after comparisons, hacking should come to mind.
You can see this if you download this PDF and scroll through the frames. There are three frames for each clip, the first is the clip I produced with FFmpeg and minor contrast adjustments, then the clip encoded without any adjustment (the baseline), and then the clip produced via iSize technology. It’s easiest to see if you load the PDF into Acrobat or any PDF viewer that lets you view a single frame at a time, and switch back and forth between the baseline and other two frames. You’ll see several instances where the contrast is noticeably improved, which resulted in significantly higher VMAF scores. The video looks better, sure, but you could have/should have achieved the same impact by optimizing contrast before encoding.
If you’re testing a preprocessor, try this approach. Encode your video at very aggressive encoding parameters so that artifacts are present. Then, test the pre-processor at the same parameters. If the artifacts are gone, it has improved your compression performance. If the artifacts are still present, but the contrast is improved, there may be some hacking going on.
If you’re interested in learning more about computing and using VMAF and other video quality metrics, click the course image below, or click here.
“The video looks better, sure, but you could have/should have achieved the same impact by optimizing contrast before encoding.”
Doesn’t this realization contradict the claim that VMAF can be hacked. VMAF measures perceptual quality which cannot be assessed by SSIM, so it’s not necessary to observe the same trends between the two metrics. An experiment that you could do in your article would be to conduct some crowd sourced MOS survey (e.g. through Amazon Turk) to illuminate whether VMAF increases in line with MOS for those videos. If VMAF aligns with MOS but SSIM doesn’t it means that it’s not hacking, or at least it means that human perception of visual quality is hackable which is something that video encoding should use.
Good point, and understood, and that’s why I gauged BitSave as a valid technology. However, as I showed with the table, there are times where increasing contrast darkens the video and makes it look noticeably worse, though the VMAF score is improved.
And yes, subjective observations are the gold standard which is why I say in my Streaming Media article, “After many hours of testing, I found that BitSave’s technology is valid and valuable, though the proof of the pudding will be how it performs in subjective testing with your test clips. Subjective evaluations of the BitSave clips would have been great, but was outside the time and expense budget for the review.
Hi Jan,
This comparison shows that SSIM mostly reacts on structural distortions (like missing some important contours, textures), but VMAF reacts on color (Histogram equalization) and frequency content (Unsharp mask). So, if these pre-processing does not change structure of a picture, but only color and some frequency components, than SSIM will not react sharply, but VMAX will. Enhancing contrast almost always improves subjective opinion provided that no other artifacts added by an encoder. In these experiments we see a substantiation that VMAF better reacts, than SSIM. If pre-processing or encoder will wipe some contours and/or textures for sure SSIM will react right away.
I also see that still in the codecs society there is misunderstanding about the role and place of pre-post processing. I think their role is heavily underestimated. Guys from Moscow are heading in the right direction.
Also I believe that a visual quality is multi-dimensional thing and very soon we will see works dedicated to vector quality measures.
Vadim