Reducing the reference frame setting from 16 to 1 when using x.264’s veryslow preset decreased encoding time by 46% on ten 2-minute test files while dropping the average VMAF score from 96.06 to 95.97, a decrease of less than .001. If you used the Medium preset, at the default setting of 3 reference frames, changing to ref=1 would drop encoding time by about 16% with a similarly irrelevant quality drop.
Reference frames are used by the x264 codec to find redundant blocks from nearby frames. These redundancies are the fuel for inter-frame compression which is much more powerful than intra-frame compression. This is why talking head videos encode much more efficiently than soccer matches; more redundancies equal greater efficiency.
Searching for more reference frames obviously extends encoding time. However, given the efficiencies delivered by these redundancies, you would assume that higher numbers of reference frames increase compressed efficiency as well as encoding time. However, this isn’t true; the quality delta is insignificant.
Here’s the basic 2-pass encoding string that I used for these trials, changing preset, data rate, and ref settings for each encode.
ffmpeg -y -i Sintel.mp4 -c:v libx264 -force_key_frames expr:gte(t,n_forced*2) -b:v 5000k -maxrate 10000k -bufsize 10000k -an -refs 1 -f mp4 -pass 1 NUL && \
ffmpeg -y -i Sintel.mp4 -c:v libx264 -force_key_frames expr:gte(t,n_forced*2) -b:v 5000k -maxrate 10000k -bufsize 10000k -an -refs 1 -pass 2 Sintel_Ref_1.mp4
Table 1 shows the comparative quality of files encoded with FFMpeg using the Medium x264 preset and the reference frame settings shown in each column. As the command string shows, I encoded all files using 2-pass 200% constrained VBR and customized the data rate for each file to achieve between 94 – 96 VMAF points, about the top-rung quality target for most OTT shops. During this first round, data rates ranged from 8 Mpbs for the Soccer and Football clips, 6 Mbps for Freedom and Haunted, and down to 1 Mbps for the tutorial clip.
Note that the Medium preset uses a reference frame setting of 3, so you would have to manually boost the settings yourself to achieve the results shown. I wanted to start with the default Medium preset since presumably, that’s what most producers use. Below I show the results using the Veryslow preset, which does use a reference frame setting of 16, and the results are very similar.
Back to Table 1. As with all tables, red is bad and green is good, so a reference frame setting of 1 delivered the overall worst quality. However, if you look at the Max Delta column, which shows the maximum difference between the highest and lowest scores, you see an average of .13% VMAF points, which is insignificant. The maximum difference for any single file was under .25%.
Table 1. FFmpeg output quality using the Medium present and reference settings shown.
Table 2 contains the encoding times, which shows a 48.75% maximum delta. So, it’s unlikely that anyone out there boosted the reference frame setting to 16 when encoding with the Medium preset, but if you had, you could more than double your encoding performance by reducing the reference frame setting to 1.
Table 2. FFmpeg encoding time using the Medium present and reference settings shown.
Chart 1 tracks the quality (in blue) and encoding time (in red) as a percentage of the maximum. You see that the ref=1 delivers 99.87% of the quality of ref=16 in 46.1% of the encoding time. If you drop from the default ref=3 to ref=1, you save about 15.5%, which isn’t staggering but isn’t chopped liver, either (If you’re under 40 and not from Brooklyn or the Bronx, please see https://knowyourphrase.com/what-am-i-chopped-liver).
Chart 1. Charting encoding time vs. quality.
I started with the Medium preset since that’s the default preset that I presume most producers use. The Veryslow preset is more relevant because many producers do use it and the default reference setting is ref=16. Here are the summary numbers for ref=1 and ref=16. Note that I adjusted the data rate of several test files downward during these trials to get closer to the 94 VMAF target. For example, the football and soccer clips are both at 6 Mbps for these tests. So don’t compare these VMAF scores to the Medium preset results shown above.
Table 3. FFmpeg output quality using the Veryslow present and reference settings shown.
As you can see, the time difference is substantial and the quality difference insignificant.
Figure 1 shows the Result Plot from the Moscow State University Video Quality Measurement Tool using the Haunted test clip, a DSLR-based scary movie shot as an advertisement for a local haunted house. The plot shows the VMAF for two files, ref=1 in green and ref=16 in red. As you can see, they track very closely throughout the file, indicating a lack of transient issues in either clip.
Figure 1. Results plot of the Haunted clip encoded using the veryslow preset. Ref=16 is red, ref=1 is green.
Skeptics out there may be grumbling that I must have messed something up. Perhaps, but I checked my encodes two ways. MediaInfo shows ref=1 on the left and ref=16 on the right. Note that I encoded these files using the FFmpeg command for seconds rather than frames, which is why MediaInfo incorrectly shows keyint=250 in both files, though the actual keyframe interval was 60 frames in this 30 fps file.
Figure 2. MediaInfo shows files with ref=1 and ref=16.
Solveig Multimedia’s excellent Zond 265 shows the same two files. On the left, in the Reference pictures box, we see that the ref=1 file actually grabbed references from two frames, not one, while on the right the ref=16 file found 15 reference frames. In both cases, the reference pictures exceeded 100% which I’m sure is correct though after multiple explanatory emails with Solveig I still don’t understand why. If you look at the reference picture frame numbers, both encodes found the bulk of the reference blocks on the same two frames, 2200 and 2208. While the ref=16 encode found reference blocks in 13 other frames, the ref=1 encode found 2% more in frame 2208, almost equalling the total numbers for ref=16.
Figure 3. Zond 265 showed that the file with ref=1 actually had two reference pictures.
In the MB types box up top, we see that the ref=1 file found reference matches for 99.99 blocks while the ref=16 file scored 100%. These numbers generally tracked pretty closely throughout the two files, with the intra/inter numbers seldom varying by more than a quarter of a percentage point, despite the huge variance in reference frames.
Perhaps that’s why the VMAF scores varied so minimally. In real-world clips, most redundant blocks are in frames very proximate to the frame being encoded. If you limit the number of reference frames, the encoder still finds about the same number of reference blocks and the amount of inter/intra frame compression is very close. So, the quality results are close.
All encoders are different; you might achieve completely different results with AWS Elemental MediaConvert or any other encoder that doesn’t use the x264 codec. Heck, you might achieve a different result with encoders that use x264. However, if you run your own encoding farm, saving encoding time probably saves you money. If you’ve jacked up your reference frame setting to chase the ultimate quality or use a preset with 5 or more reference frames, you might run some experiments to see if you achieve similar results.
If I messed something up, please let me know at firstname.lastname@example.org.
I’ll discuss these and other research-based encoding findings in my webinar on H.264 production this Thursday, Mar 25, 2021, 2:00 PM – 3:30 PM EDT. I hope to see you there.