The sheer number of video quality measurement tools makes it difficult to choose the right metric. Here’s a quick overview of some of the options and what they offer.
All compressionists should use video-quality metrics to make compression-related decisions regarding data rate, key frame interval, and other settings. Unfortunately, the sheer number of options and the noise surrounding their comparative fficacy make it tough to choose the right metric.
The gold standard of all quality metrics are controlled subjective experiments, but these are cumbersome and expensive to deploy. The technical measure of any video-quality metric is how well it predicts these subjective scores.
The most simplistic quality metrics are error-based measures such as mean squared error (MSE) or peak signal-to-noise ratio (PSNR), which compares the pixels in the compressed stream against those in the original stream and measures how much they differ. The problem with PSNR is that it doesn’t attempt to prioritize differences that the human eye would actually perceive or consider important.
Does this mean that you should never use PSNR? Well, as recently as December 2015, Netflix used PSNR as the basis for its per-title optimization decisions, though it has since moved to video multimethod assessment fusion (VMAF), which I discuss later. So what’s acceptable in December 2015 certainly can’t be useless 10 months later. Still, purely as a visual quality metric, PSNR is on the low end of the totem pole.
The next class of video-quality metric attempts to model human visual systems to more accurately predict how the human eye might rate a video. There are a number of relatively standard algorithms such as structural similarity (SSIM), multiscale structural similarity (MSSSIM), video quality metric (VQM), and motion-based video integrity evaluation (MOVIE). Beyond these, there are metrics that are unique to a specific tool, such as the ClearView System Option—Sarnoff JND available with ClearView’s Video Clarity tools; the difference mean opinion score (DMOS) available with Tektronix‘s Picture Quality Analyzer (PQA); and the SSIMplus metric, which is available in the SSIMWave Video Quality-of-Experience Monitor (SQM).
All of these human visual systems-based tools claim to predict actual subjective ratings with much greater accuracy than PSNR, and often each other. One key differentiator in this class is the availability of display-specific measures, since video that looks great on an iPhone could look grainy and blocky on a 4K TV set. This feature is available on SSIMWave’s SQM and Tektronix’s PQA.
All of these metrics are based on relatively static mathematics, which complicates customization for different video types and minimizes the opportunity to get “smarter” over time. Netflix’s recently launched open source metric, VMAF, incorporates machine learning to avoid these limitations. Specifically, to train the system, you enter datasets identifying different videos and their subjective scores from human-based testing to train the system to refine its own results to more closely match human scores. This helps the system improve over time, and it can also customize results for different content types, such as cartoons or sports. However, you can’t currently customize the score by display type, though this is coming. It may be open source, but VMAF is not yet available in a retail tool. As metrics get more advanced, they either get harder to access or more expensive (sometimes both).
Which class of metric is right for you? The best you can afford. Just keep in mind that a well-designed tool to apply the metric is equally important. At some point, you’ll want to visually review the quality differences reported by the metric, and the tool should make this simple. You’ll also want to automate operation over multiple files. Finally, the most sophisticated tools, such as Tektronix’s PQA software, let you focus your testing on specific issues, such as artifact or blockiness detection, using both full reference models (compressed compared with the original) and nonreference models (analyzing the compressed file only, which is faster). In short, it’s not all about the algorithm; it’s also about the tool that delivers it.