I first encountered the line, “Anything worth doing is worth overdoing,” in the Robert Heinlein novel Time Enough for Love. I bring this up because this is my fourth recent article on GOP size and I think I’ve reached that point.
To recount, I reported on testing in an article entitled, The Impact of GOP Size on Video Quality. The key finding there related to how little quality improved once GOP sizes exceeded 2-3 seconds for either H.264 or HEVC. You see this in table 1.
Comments from practitioners on one of the LinkedIn posts about the article were so useful that I collected them in an article entitled, Real-World Perspectives on Choosing the Optimal GOP Size. You can see a summary of the findings in Table 2 below.
That article spawned several comments from additional practitioners, including this one from Meta’s David Ronca, who said. “The optimal gop size is aligned to the encoders placement of intra frames with a max spacing between 5-10 seconds. That is, let the encoder decide as much as possible. Some years ago, the avg spacing was -3.5 seconds. 5-second max key frame distance would have few redundant idrs.”
As most readers may know, David led the Netflix encoding team that developed per-title encoding, VMAF, and shot-based encoding. If there were a Mount Rushmore for video encoding, David would definitely be on it, so I figured it would be worth a few tests to verify his recommendations.
The Encoding Strings
To test this, I encoded two files, the full-length versions of Meridian and Tears of Steel, both around 12 minutes long, using the x264 codec. My standard recommendation for VOD encoding is 2-pass 200% constrained VBR with a 2-second GOP. I tested with a bitrate designed to deliver a VMAF score of around 95, suitable for the top rung of a premium content encoding ladder.
Here’s the string for the Meridian file.
ffmpeg -y -i Meridian.mp4 -c:v libx264 -preset veryslow -threads 8 -pass 1 -b:v 3400k -maxrate 6800k -bufsize 6800k -g 60 -keyint_min 60 -sc_threshold 0 -f null NUL
ffmpeg -y -i Meridian.mp4 -c:v libx264 -preset veryslow -threads 8 -pass 2 -b:v 3400k -maxrate 6800k -bufsize 6800k -g 60 -keyint_min 60 -sc_threshold 0 Meridian_2sec_fixed_264.mp4
Setting the keyframe minimum at 60 ensured that there would be no keyframes at scene changes.
Then, I tested with a GOP size of five and ten seconds, with GOPs at scene changes. Here’s the encoding string for the 5-second GOP for Tears of Steel.
ffmpeg -y -i TOS.mp4 -c:v libx264 -preset veryslow -threads 8 -pass 1 -b:v 3200k -maxrate 6400k -bufsize 6400k -g 120 -sc_threshold 40 -f null NUL
ffmpeg -y -i TOS.mp4 -c:v libx264 -preset veryslow -threads 8 -pass 2 -b:v 3200k -maxrate 6400k -bufsize 6400k -g 120 -sc_threshold 40 TOS_5sec_scenechange_264.mp4
This file would insert keyframes at scene changes and 120 frames (for the 24 fps file) if there were no scene changes within those five seconds. Table 3 shows how that impacted the number of GOPs in the two roughly 12-minute test files.
Table 4 shows the results in both VMAF (harmonic mean) and low-frame, the lowest quality frame in the file, a measure of the potential for transient quality differences.
While the harmonic mean VMAF delta doesn’t appear significant, the best way to assess this is to determine how much you would have to boost the bitrate of the 2-second GOP file to equal the quality of the 10-second GOP file. I tested this, and had to boost the bitrate of the 2-second GOP by about 400 kbps to reach the same average quality.
A savings of 400 kbps translates to about 175 MB/hour. At a bandwidth cost of $0.005/GB, this equals about $0.000806 saved per hour. This is a tiny number, but if you multiply it by the 100 billion hours of video delivered by Netflix in 2023, the savings equals around $8.6 million. Granted, Netflix has its own CDN with lots of local POPs, so the actual video streamed is much lower, as is the cost/GB. At a rough estimate for Hulu at 15 billion hours delivered in 2023, the number drops to around $1.3 million, still within Hulu’s rounding error, but a number worth chasing nonetheless.
Note that 400 kbps didn’t bring the low-frame score anywhere close to the 10-second GOP score, though I don’t want to overstate the significance of this number. Figure 1 is the Results Plot from the Moscow State University Video Quality Measurement Tool, showing the per-frame VMAF scores over the duration of the two files. The 2-second GOP is in red; the 5-second GOP is in green. I visually checked the frames with the two lowest scores, and the quality delta wasn’t visible. As you can see, the quality drop was very transient in these regions, so it’s highly unlikely that any viewer would see them. Still, at some point these issues will be visible to a viewer, and the higher low-frame score is definitely worth chasing.
Implementation Challenges
Obviously, to implement either the 5-second or 10-second GOP, you need a packager that can handle variable GOP sizes and a player that doesn’t hiccup when playing them. I know that Netflix uses variable segment sizes during its dynamic optimization, and Bitmovin offers shot-based encoding as a feature of its cloud encoder. So, packaging for variable GOP sizes is a problem that’s been solved multiple times.
If your encoder enables customizable GOP sizes and I-frame placements at scene changes, and your packager and player can manage variable GOP sizes, give it a try. The results are likely to be modest, but like the man said, better is always better.