Best Practices for Netflix’s VMAF Metric

Jan Ozer November 9, 2018 Encoding Leave a comment 3,975 Views

Zhi Li, senior software engineer at Netflix, recently co-wrote a paper entitled VMAF: The Journey Continues posted on the Netflix Technology Blog on Medium. After reviewing why VMAF was developed, how it’s been adopted in the industry, and some recent improvements, the paper discusses some best practices for using VMAF. In particular, this last section will be valuable to anyone who uses this quality metric.

As detailed in the blog post, VMAF combines human vision modeling with machine learning and was developed in collaboration with several university professors. In June 2016, Netflix open sourced the technology and published it on Github.

Since 2016, VMAF has been integrated into multiple video quality measurement tools, including products from Moscow State University and Elecard, as well as FFmpeg, though it’s only been tested for Mac and Linux distributions, not Windows. Netflix uses VMAF to evaluate codecs, to power encoding decisions throughout their production pipeline, and for A/B experimentation.

Since launching VMAF, Netflix has optimized operating speed and introduced a frame-skipping feature that enables the metric to be computed on every one of N frames. As the article states, “this is the first time that VMAF can be computed in real time, even in 4K, albeit with a slight accuracy loss.” Netflix has also improved VMAF’s accuracy by improving the elementary metrics and machine learning model, and by broadening the training set.

One deficit in the first VMAF release was a one-size-fits-all approach that assumed that all viewers watched a 1080p display in a living room-like environment. Since then, Netflix released a phone model and more recently a 4K model. These models could enable organizations to make informed decisions to save bandwidth without significantly impacting perceived quality and QoE. For example, mobile viewers perceive less of a difference between 720p and 1080p videos, so in some cases, it might make sense to limit the adaptive group distributed to mobile phones to 720p.

The figure above is the Result Plot from the Moscow State University Video Quality Measurement Tool showing the default (green) and phone (red) VMAF score for the same 720p video. For 1080p viewing in a living room, the default score of 88.6 might be considered too low, suggesting that a 1080p file would improve QoE. However, the average score of 99.28 for the 720p file in the phone model suggests that a 1080p version of the file likely wouldn’t improve perceived quality on a mobile phone. In this instance, it might make sense to remove the 1080p file from the manifest file retrieved by mobile phones.

With this as background, let’s discuss the most significant best practices discussed in the Netflix article.

Contents

VMAF Best Practices

First, the article discusses how to interpret VMAF scores, or more specifically how to map them to predicted subjective ratings from real viewers. To explain, while we know that a score of 70 indicates higher quality than a score of 60 (higher is always better), how would an actual viewer rate that video with a score of 70? The article explains:

Viewers voted the video quality on the scale of “bad,” “poor,” “fair,” “good,” and “excellent,” and roughly speaking, “bad” is mapped to the VMAF scale 20 and “excellent” to 100. Thus, a VMAF score of 70 can be interpreted as a vote between “good” and “fair” by an average viewer under the 1080p and 3H condition. [Editor’s Note: 3H specifies a distance from the screen that is 3 times the height of the screen.]

In most cases, compressionists seeking to make an encoding decision care more about how a real viewer will rate the video than a single objective number. For this reason, the ability to tie VMAF scores to predicted subjective ratings adds a lot of utility.

Next, Netflix tackled the technical issue of how to measure lower than full-resolution videos. To explain, like most metrics, you can only calculate VMAF when the source and encoded videos share the same resolution. So, if you’re computing the score for a 480p video encoded from 1080p source, should you scale the 480p video back to 1080p, or the 1080p source to 480p? The quick answer is the former; you should scale the encoded video to 1080p and measure against the original source. Most practitioners use FFmpeg for this operation.

Of course, FFmpeg supports multiple upscaling techniques, including bilinear, bicubic, Lanczos, and others. Here, the article recommends using bicubic upsampling if you don’t know the algorithm used by the actual display device. I’ve always used and recommended Lanczos because according to a white paper here, that’s the technique used by NVIDIA GPUs, which own the lion’s share of the market. That said, the scoring difference between different upscaling techniques is minimal, and so long as you’re consistent, it probably doesn’t matter which technique you use.

Metrics for A/B Experimentation

Like most quality metrics, VMAF produces a single score per frame, and most tools that I’ve seen average the scores for the individual frames into a single overall score. Though Netflix notes that there are other averaging techniques available, the article states that “simple arithmetic mean (AM) is the best way of averaging, in that it yields the highest correlation with subjective scores.” This is a nice verification of the technique used by the MSU tool, and I’m assuming most others.

That said, the article also recognizes that using a single score to represent the quality of a long file involves certain risks, among them missing quality drops that might degrade QoE but not significantly reduce the overall score. To counteract this, the article recommends assessing aggregate quality, start play quality, and variability, or “the average VMAF over the entire session, average VMAF over the first N seconds, and the number of times the VMAF value drops below a certain threshold relative to the previous values.”

In my own practice, I find tracking quality drops absolutely essential. As an example, the Result Plot below shows the VMAF scores for a file encoded using 2-pass VBR (red) and 1-pass CBR (green). The overall score isn’t too different, as the VBR file scored 96.24 and the CBR 94.5, both well into the excellent range.

However, the CBR file has multiple regions that drop well below the 2-pass VBR file, suggesting a much lower quality of experience than the composite score suggests. Note that the Result Plot also makes it simple to display the frames (original, file 1, file 2) at any location on the graph, which simplifies subjectively verifying that the lower scores correlate with visibly lower quality.

Beyond the Result Plot, the MSU tool lets you save “bad frames,” or extract N lower quality frames with their score and frame number, another way to track quality variability over the duration of the file. You can save bad frames when operating the program from the command line, which is useful because the Result Plot is only available via the GUI.

The bottom line is that a single VMAF score, or the single score from any quality metric for that matter, provides an incomplete picture of the QoE that file will deliver. You have to explore the lower quality regions of the file, including subjectively verifying that the visual quality of the frame is as poor as the score suggests. You also have to check that the quality deficit is visible during real-time playback; otherwise, it may be irrelevant. As an example, on the extreme right of the Result Plot above is a drop in quality for a single frame, which most viewers wouldn’t actually perceive.

The Journey Continues

In the final section of the article, Netflix lists areas where they hope to improve VMAF in the future. While some were straightforward (adding models that better measure temporal perceptual effects) some raised critical questions about how to use VMAF, which, fortunately, Netflix agreed to answer.

For example, the article states, “VMAF does not fully capture the benefits of perceptual optimization options found in many codecs, although it is moving in the right direction compared to PSNR.” This raises the question of whether these optimizations should be enabled or disabled for files measured with VMAF.

So, I asked Lhi:

The blog post states that “VMAF does not fully capture the benefits of perceptual optimization options found in many codecs.” Does this mean that you should turn them off (tune PSNR or SSIM) when encoding for analysis with VMAF, or does that matter?

Netflix responded, “Since VMAF partially captures the benefit of perceptual optimization, and if at the end of the day you will be encoding with these settings on, we still recommend turning them on.”

The article also states, “The VMAF model works the best with videos of a few seconds. It does not capture long-term effects such as recency and primacy, as well as rebuffering events.”

We asked Netflix to clarify, and they responded, “The main reason we made that statement is because VMAF has been trained using short clips of a few seconds in the subjective tests. We believe VMAF averaged over the entire video file still well captures the first-order effects of perceptual video quality. There are second-order effects, however, such as recency (i.e. subject tends to weigh more on video portions towards the end) and primacy (i.e. subject tends to weigh more on the beginning portion of a video), that are not captured by a short-term model like VMAF, but these are less crucial.”

Where to Go from Here

Overall, in both my writing and consulting practice, I’ve found VMAF a very useful metric for assessing video quality, particularly for measuring the quality of different resolution rungs on an encoding ladder, where I’ve found PSNR inadequate. It’s great that Netflix chose to open-source VMAF so the rest of us could use it, and has continued to invest in VMAF’s development.

Zhi Li spoke at Streaming Media West 2018 and I’ve embedded his talk below. Here’s a description.

VMAF (Video Multi-Assessment Fusion) is a quality metric that combines human vision modeling with machine learning. It demonstrates a high correlation to human perception and gives a score that is consistent across content. VMAF was released on Github in 2016 and has had considerable updates since that time. This talk focuses on the latest VMAF improvements and enrichment, such as speed optimization, accurate models to predict mobile and 4K TV viewing conditions, and adding a confidence interval to quantify the level of confidence in the quality prediction. In addition, we discuss VMAF use cases and look at the VMAF road map for the near future.

Streaming Learning Center Where Streaming Professionals Learn to Excel

Best Practices for Netflix’s VMAF Metric

Related Articles

VMAF Best Practices

Metrics for A/B Experimentation

The Journey Continues

Where to Go from Here

About Jan Ozer

Check Also

Ateme’s Mickaël Raulet Talks AI Codecs and OTT Contribution at Mile High Video 2025

Romain Bouqueau on GPAC, Low-Latency Streaming, and AI at Mile High Video 2025

Bitmovin’s Igor Oreper at Mile High Video 2025 Talks New Web Player

Leave a Reply Cancel reply