This post describes how to formulate the optimal encoding ladder with VMAF. This analysis is excerpted from a lesson in the online course Streaming Media 101: Technical Onboarding for Streaming Media Professionals.
I received the following question from a reader; it’s got such general applicability that I thought I would share my response.
Question: We’re currently reviewing our ABR ladder and we’re using (among other tools) the VMAF-score to decide which bitrate to use to get the qualities we need… I’m interested in how a ladder can be measured so we can have the proper VMAF score.
Answer: This is a three-step process; setting the top rung, identifying the data rate of subsequent rungs, and choosing the optimal resolution for each rung.
As an overview, most of what I’m describing is derived from the Netflix brute force per-title method of identifying the optimal data rates and the resolution of rungs in the encoding ladder from this seminal blog post. I say brute force because it involves many dozens of encodes of the same file at different resolutions and data rates to find the convex hull, which is “where the encoding point achieves Pareto efficiency.” In non-math speak, this means the resolution that achieves the highest VMAF score for each rung in the ladder.
Netflix performs this analysis for each video they distribute, which makes sense when your files are viewed tens of millions of times (actually, Netflix has moved on to Dynamic Optimization, which uses a similar analysis for each shot). Most other producers should do this with 5-10 files per genre (sports, animation, talk shows, movies) to create an average ladder for that content.
Contents
Step 1: Set the Top Rung of Your Encoding Ladder
You start with the brute force encoding, rendering your source file multiple times at multiple resolutions; you can see the options I’ve selected in Figure 1 for the relatively low-motion Netflix test clip, Meridian. My practice is to encode at 200 kbps intervals above 1 Mbps and 100 k intervals below at the resolutions shown. With a bit of practice you can narrow down the encodes that you perform; obviously, it doesn’t make sense to encode at 3.2 Mbps for 180p.
Table 1 shows the VMAF scores at each resolution and data rate. The yellow rows are the rungs of the ladder selected as described below. The Max column identifies the highest score for each data rate. The green box is the cell identified via conditional formatting as equalling the Max score which is the highest VMAF score at that resolution. More on this below.
The top rung of your ladder should be the lowest data rate that achieves a VMAF score of between 93-95. The 93 score comes from this white paper which found that with 4K clips, “if a video service operator were to encode video to achieve a VMAF score of about 93 then they would be confident of optimally serving the vast majority of their audience with content that is either indistinguishable from original or with noticeable but not annoying distortion.”
The 95 score comes from this more recent paper, which states that 95 VMAF is the lowest score “at which a video signal is on average subjectively indistinguishable from the original video signal.” This may be a bar too high for some services, but I decided to go with 95. Note that had I used 93 as the target, I would have saved about 1400 kbps in bandwidth for the top-quality clip.
Many companies set a bandwidth limit for their clips that they don’t exceed even if necessary to achieve a VMAF score of 93; say 6 Mbps. For extremely hard-to-encode clips, it may not be possible to achieve 95 or even 93 without exceeding that limit. In this case, use 6 Mbps as the starting point for this analysis.
Choose the Data Rate of Lower Rungs
How do you choose the data rate of subsequent rungs? The best guidance I’ve seen is from Apple’s TN2224 which was deprecated by the HLS Authoring Specification (which doesn’t discuss the spacing issue) and later taken down. Regarding the spacing of the adjacent data rates, TN2224 stated.
“Adjacent bit rates should be a factor of 1.5 to 2 apart. You should keep adjacent bit rates at a factor of 1.5 to 2 apart. If you put them too close together, you will increase your number of variants unnecessarily. For example, it won’t help much if you’ve got both a 150 kb/s and a 180 kb/s stream. On the other hand, don’t make them too far apart either. If they are too far apart, the client may find itself in a situation where it could actually have gotten a better stream but there isn’t one available. An example of this is if the client has to go from 500 kb/s to 3000 kb/s. This is a factor of six differences.”
To achieve this, I multiply each rung by .6 which gives you a spacing of 1.66x between rungs. Going lower (say to 1.5x) increases the rungs in your encoding ladder and your encoding and storage cost. Going higher (say to 2.0) does the reverse. For this analysis, I kept adding rungs until I had one rung at or under 300 kbps.
Choose the Optimal Resolution For Each Rung
Whatever multiple you choose, this simple math identifies the data rate for the ladder rungs, which again, I’ve identified in yellow. The score in green in each row identifies the highest VMAF score for all resolutions encoded at the data rate which is the resolution you would include in your encoding ladder. Figure 2 shows the encoding ladder for the Netflix Meridian clip derived from this analysis.
This analysis assumes full-screen playback for all clips which is obviously fair to assume for Netflix. If your website has player window sizes like 720p or 360p that are commonly used I would make sure to have at least one rung at that resolution and might bunch several at that resolution.
Why Per-Title is Key
Let’s look at another example, the Harmonic Football test clip. Here’s the data.
Here’s the encoding ladder. As you can see, the ladder rungs (and VMAF scores) are wildly different since the file is a much more difficult to compress sports video. This is why per-title encoding is so essential – encoding the Meridian clip using this ladder would waste over 2 Mbps for the top rung.
So, that’s it. I hope you find the foregoing useful. Here’s the course this lesson is from.