One visual quality metric that’s getting a bit more love lately is the Structured Similarity Index (SSIM). For example, when Facebook launched their first VR Metric, SSIM360, they based it on SSIM. I’ve generally avoided using SSIM because the scoring range is too small for my liking (0 – 1) and I wasn’t aware of any way to map SSIM scores to subjective evaluations.
Well, a colleague recently pointed me to an article entitled SSIM-based Video Admission Control and Resource Allocation Algorithms published by multiple researchers from the Department of Information Engineering, University of Padova, Italy. The article contains the table copied below, which maps SSIM scores to mean opinion scores, which are subjective ratings. These scores were established in yet another article, available here. As you can see, scores above 0.99 should look perfect, while scores in the 0.95 – 0.99 range would indicate the presence of “perceptible but not annoying” impairments. I have a project now that involves SSIM scoring, and these data points are definitely useful; I hope you find them useful, too.
VMAF and The Magic Number 93
I still prefer using Netflix’s VMAF metric, particularly for assessing the quality of files in an encoding ladder. That’s because VMAF scores range from 0 to 100, providing a more meaningful spread, and because VMAF is designed to rate files from resolutions ranging from 240p to 1080p and has been used to rate videos as large as 4K.
In his article entitled VMAF Reproducibility: Validating a Perceptual Practical Video Quality Metric, RealNetworks CTO Reza Rassool concluded “if a video service operator were to encode video to achieve a VMAF score of about 93 then they would be confident of optimally serving the vast majority of their audience with content that is either indistinguishable from original or with noticeable but not annoying distortion.” So a 93 VMAF score is about the same as .95 for SSIM. Another useful data point relating to VMAF is that a differential of six points is a Just Noticeable Difference, which obviously adds context when comparing scores.
Actual human subjective ratings will always be the gold standard, though totally impractical for most day-to-day use cases where objective metrics shine. Whether you’re using SSIM or VMAF, when you need to predict subjective quality based upon objective scoring, it’s nice to have authoritative backing.