Netflix Introduces New Quality Metric

Netflix announced the open-source availability of the Video Multimethod Assessment Fusion, which it’s now using instead of PSNR to analyze the quality of transcodes in its vast catalog.

Yesterday, Netflix announced the open-source availability of its new video quality metric, the Video Multimethod Assessment Fusion (VMAF) in a long, explanatory blog post. Netflix uses the metric now, and it appears to have displaced older methods like Peak Signal-to-Noise ratio (PSNR) in Netflix workflows. Here’s what you need to know.

Why Did Netflix Create Its Own Metric?

Because it discovered older metrics weren’t effective. The Netflix video technology group is continuously working to choose the best codec and encoding parameters, and to apply the best quality control measures to their video production workflows. While human benchmarking is the gold standard, “manual visual inspection is operationally and economically infeasible for the throughput of our production.” Netflix’s use of existing metrics was hampered by the fact that they do not “consistently reflect human perception.”

To demonstrate this, Netflix created 34 6-second test clips from its own catalog and from publicly available clips. The technology group output these in multiple resolutions and data rates to produce 300 test clips that they ran through subjective tests. They then ran these clips through four different objective metrics, PSNR, Structural Similarity Index (SSIM), Multiscale FastSSIM, and PSNRHVS (Human Visual System). Netflix found that “these metrics fail to provide scores that consistently predict the DMOS ratings from observers.” DMOS is Differential Mean Opinion Score, the human score achieved in subjective testing.

To develop VMAF, Netflix collaborated with Prof. C.-C. J. Kuo, who is the Director of the Multimedia Communications Lab and Dean’s Professor in Electrical Engineering-Systems at University of Southern California.

How Does the New Metric Work?

The new metric works by combining multiple elementary quality metrics and fusing them together with a machine-learning algorithm, specifically a Support Vector Machine (SVM) regressor. The three elementary metrics are:

  • Visual Information Fidelity (VIF)
  • Detail Loss Metric (DLM)
  • Motion

This approach is significant in (at least) two ways. First, the elementary quality metrics used by Netflix in this first iteration are not baked into the metric. So developers who use this approach can choose and deploy their own elementary metrics within VMAF.

Second, the machine-learning algorithm lets developers create their own data sets and “teach” the algorithm how to evaluate their specific classes of content. So a developer of surveillance equipment would use a different dataset than the Cartoon Network, which would use a different dataset than Turner Classic Movies.

How Effective is the New Metric?

To test the first iteration of the metric, Netflix divided the 300 test clips into two groups; one used to train the metric, the other used to test the metric. In the article, Netflix compared VMAF to PSNRHVS, the highest performing traditional algorithm, with the results shown in Figure 1.

Figure 1. VMAF compared to PSNRVHS. Note that the figure has been modified slightly for presentation purposes.

Each point on the scatter graph represents a separate test file, with the coordinates set by the DMOS score on the horizontal axis, and the VMAF/PSNRHVS score on the vertical axis. If the scores matched perfectly, they would plot precisely on the red centerline drawn in each graph.

As such, closeness to the red centerline in each graph indicates the accuracy with which the metric predicted the subjective score. As you can see, the VMAF results are much more tightly bunched along the centerline, indicating a superior ability to accurately predict subjective scores on the Netflix test files.

What’s Left to Do?

Netflix identified three open questions.

  1. Creating a test that incorporates different viewing platforms. For example, a 1080p file might look just as good as a 4K file when viewed on a 1080p display, though the 4K file should look much better on a 4K display. VMAF doesn’t currently account for this.
  2. Temporal pooling. A video with 90 high-quality minutes and 5 poor-quality minutes would achieve a very good average score, though a human viewer would likely remember and weigh the poor-quality sections differently. As Netflix says, “a pooling algorithm that gives more weight to lower scores may be more accurate towards human perception.”
  3. Consistency among source materials. VMAF is highly dependent upon the quality of the source file. “Because of this, it can be inaccurate to compare (or summarize) VMAF scores across different titles,” according to the blog post. “For example, when a video stream generated from an SD source achieves a VMAF score of 99 (out of 100), it by no means has the same perceptual quality as a video encoded from an HD source with the same score of 99.”

Interestingly, the SSIMplus Index, a proprietary quality metric that was not evaluated by Netflix in the blog post, already purports to deliver numbers 1 and 3, as well as a higher correlation to human DMOS scores than PSNR. Briefly, Dr. Zhou Wang, one of the developers of the original SSIM algorithm, developed SSIMplus and is the primary algorithm used in the SSIMWave Quality of Experience Monitor, which Streaming Media reviewed in 2015.

Where Can I Get the Open Source Package?

You can get it on GitHub.

Why is VMAF Important?

As recently as last December, Netflix relied upon PSNR to drive its per-title optimization process, though Netflix stated then that VMAF was coming. Now VMAF is here, and it’s available for free implementation by other video producers and tools developers. While PSNR isn’t going away anytime soon, you should expect a number of third-party analyses of VMAF in the next 12-18 months, including (hopefully) comparisons with SSIMplus.

In the meantime, it will be interesting to see if VMAF is incorporated into third-party QC and video quality analysis tools. If it is, and if VMAF is independently verified, the Netflix blog post could very well signal the start of the serious decline of the use of PSNR and similar metrics in video and codec benchmarking.

About Jan Ozer

Avatar photo
I help companies train new technical hires in streaming media-related positions; I also help companies optimize their codec selections and encoding stacks and evaluate new encoders and codecs. I am a contributing editor to Streaming Media Magazine, writing about codecs and encoding tools. I have written multiple authoritative books on video encoding, including Video Encoding by the Numbers: Eliminate the Guesswork from your Streaming Video (https://amzn.to/3kV6R1j) and Learn to Produce Video with FFmpeg: In Thirty Minutes or Less (https://amzn.to/3ZJih7e). I have multiple courses relating to streaming media production, all available at https://bit.ly/slc_courses. I currently work as www.netint.com as a Senior Director in Marketing.

Check Also

How Thread Count Impacts Video Encoding Quality, Throughput, and Cost

Learn how thread count impacts video quality, encoding speed, and costs in FFmpeg workflows. Master configurations for optimal production and testing results.

Mac Video Apps for Streaming Professionals

Though I work primarily on Windows computers, I also have several Macs. Here are the …

Choosing the Best Preset for Live Transcoding

When choosing a preset for VOD transcoding, it almost always makes sense to use the …

Leave a Reply

Your email address will not be published. Required fields are marked *