The launch of FFmpeg 4.0 gave many compressionists their first chance to test the new AV1 codec, which is included in experimental form. For the first time, you had a single encoder that could produce all relevant codecs: H.264 with the x264 codec, HEVC with the x265 codec, VP9 using the Google Libvpx-vp9 codec, and AV1 using the LibAOM codec.
Since I was updating my book, Learn to Produce Video with FFmpeg in 30 Minutes or Less, for FFmpeg version 4.0, I decided to dig into the AV1 encoding parameters. After running some competitive encodes on a single 5-second clip for the book, I proposed an expanded test for Streaming Media.
I originally thought to test with four 5-second clips. Once I learned that my editor was embarking on a vacation, I decided to increase that to four 30-second clips. Some deficiencies in my test clips and procedures led to a second round with two 30-second clips.
Given the extra time, I solicited input from Google for the AV1 and VP9 encodes and MulticoreWare, the developer of x265, for the x265 encodings. All input is included in the encoding parameters shown below. After providing input, a representative from MulticoreWare added, “I believe it is not going to be a fair comparison between one commercial encoder and other reference encoders irrespective of any quality metric (PSNR/SSIM/VMAF). The question is why comparing between x265 vs. AV1/VP9, why not HM vs. AV1/VP9. In the latter case, it will be a real comparison between HEVC and AV1/VP9.”
HM is the HEVC reference encoder. In my experience, a reference encoder encodes with all parameters set for maximum quality, showing off the full quality available for the technology without concern for encoding speed. This didn’t accurately describe my AV1 encoding parameters.
Rather, with the AOM encoder, you set the encoding speed/quality tradeoff using the -cpu-used switch, which can range from -8 to +8, with lower values delivering higher quality (https://trac.ffmpeg.org/wiki/Encode/AV1). As you’ll see in our encoding script below, for the first pass, we set this at the lowest quality (+8) and returned it to the mid-range (0) for the second pass. So, we actually used mid-range settings for our encodes, not the full quality settings.
I asked Google to comment on MulticoreWare’s comment and haven’t yet heard back. From my perspective, it seemed reasonable to test all codecs using the same encoder but wanted to share MulticoreWare’s concerns.
Contents
TL;DR
For those with TL;DR tendencies, here’s the net/net. Though AV1’s quality was impressive, encoding times are simply too long for all but the very tippy top of the VOD pyramid. If your videos aren’t watched millions of times, AV1 encoding is unaffordable.
Playback testing was a bit discouraging, though this may be fixable by improving the AV1 decoder’s ability to utilize multiple cores. If this isn’t possible, it may mean that AV1 is impractical to deploy until hardware playback support is available, which means 2020 for most mobile devices.
Our Tests
I initially tested with four video-only 30-second test clips; one from the movie Elektra, another from a football test clip available on the Harmonic website, an excerpt from Netflix test clip Meridian, and a concert video of country singer Josiah Weaver singing the song “Freedom.”
Encoding
I encoded all test clips with FFmpeg version ffmpeg-20180728-eb94ec3-win64-static. I kept the encoding scripts basic. I sent my proposed script to MulticoreWare for their review, and they recommended “Default AQ mode in x265 is 1. We should enable AQ mode 2 while comparing with other codecs (enable –tune-ssim). There is a huge difference between aq mode 1 and 2 in terms of PSNR and SSIM.” So, I tuned for SSIM and used the Veryslow preset, which in my tests has consistently produced better quality than Placebo.
ffmpeg -y -i input.mp4 -c:v libx265 -preset veryslow -tune ssim -x265-params bitrate=6000:vbv-maxrate=12000:pass=1 -f mp4 NUL & \
ffmpeg -i input.mp4 -c:v libx265 -preset veryslow -tune ssim -x265-params bitrate=6000:vbv-maxrate=12000:pass=2 output_HEVC.mp4
I followed the same lead for H.264, using this script:
ffmpeg -y -i input.mp4 -c:v libx264 -preset veryslow -tune ssim -b:v 6000K -maxrate 12000K -pass 1 -f mp4 NUL & \
ffmpeg -i input.mp4 -c:v libx264 -preset veryslow -tune ssim -b:v 6000K -maxrate 12000K -pass 2 output_H264.mp4
I used the following script for VP9 after running it by Google.
ffmpeg -y -i input.mp4 -c:v libvpx-vp9 -pass 1 -b:v 6000K -threads 8 -speed 4 -tile-columns 4 -auto-alt-ref 1 -lag-in-frames 25 -frame-parallel 1 -f webm NUL && \
ffmpeg -i input.mp4 -c:v libvpx-vp9 -pass 2 -b:v 6000K -minrate 6000K -maxrate 12000K -threads 8 -speed 0 -tile-columns 4 -auto-alt-ref 1 -lag-in-frames 25 -frame-parallel 1 output_VP9.webm
And used this script for AV1, again after running it by Google.
ffmpeg -y -i input.mp4 -c:v libaom-av1 -strict -2 -b:v 6000K -maxrate 12000K -cpu-used 8 -pass 1 -f matroska NUL & \
ffmpeg -i input.mp4 -c:v libaom-av1 -strict -2 -b:v 6000K -maxrate 12000K -cpu-used 0 -pass 2 output_AV1.mkv
I encoded each clip at 5 data rates, 2Mbps to 6Mbps inclusive.
As mentioned, when I originally started the review process, I intended to use five-second clips and had already performed the encodes. When I switched to 30-second clips, I extracted the new source clips and swapped them in the batch files. I created separate FFmpeg scripts for all 20 AV1 test files that I ran in separate Command windows, and it took twelve days and five hours to finish the encodes on my 40-core HP Z840 workstation. With all 20 encodes running, CPU utilization was around 45%, though this dropped as the lower data rate encodes completed.
While encoding and analyzing the HEVC, VP9, and H264 clips, it struck me that while using only one I-frame for a 5-second clip was reasonable, I probably should have switched to a 2-second I-frame interval for the 30-second clip. I did so for the second round but stayed with the default keyframe interval of 250 frames for these encodes.
Encoding Speed
To test encoding and decoding speed, I wanted to use a single-CPU computer rather than a 40-core workstation. So, I tested encode/decode performance on an HP ZBook notebook powered by a 2.8 GHz Intel Xeon E3-1505M v5 CPU with an NVIDIA Quadro M1000M graphics chipset along with the HD Graphics 530 GPU embedded in the CPU (Figure 1). For the encoding test, I encoded a five-second excerpt from Tears of Steel.
Figure 1. Here are the specs on the single CPU test machine used for encoding and decoding trials.
Table 1 presents encoding times in seconds, and how the encode relates to real-time performance. For these tests, I encoded with version ffmpeg-20180716-8aa6d9a-win64-static because the later version I used for the encodes didn’t work on the notebook.
The AV1 encode took 62 hours and 48 minutes, which was 45,216 times longer than real time. This compared to under five minutes for HEVC and VP9, and 58x and 45x real time, respectively. Google advised that the encoding speed had dropped in later versions, but obviously, it has a long way to do.
Table 1. Encoding times for the respective technologies.
Not to belabor the point, but encoding times translate to encoding cost. For example, if you run your encoding farm in the cloud, you will spend roughly 800 times more to encode AV1 than HEVC. Obviously, you can only recoup this cost if the bitrate savings are substantial and spread over millions of views.
Objective Quality—Round 1
I measured VMAF quality with the Moscow University Video Quality Measurement Tool. The overall average VMAF scores for the four clips are shown in Figure 2. As you can see, AV1 was the clear leader, H264 the clear laggard, and x265 and VP9 neck and neck in the middle. No surprises there.
Figure 2. Average VMAF scores for our 4 test clips.
Looking at the results, the range of scores is disappointing. Generally, a VMAF score of 93 predicts a clip without disturbing visual artifacts, and on average, all technologies except for H.264 averaged close to this number at 2Mbps and easily exceeded it for all higher data rates. I would have liked to have seen some scores in the 70s and 80s for HEVC and VP9.
Figure 3 presents the results from the football clip, which was the most challenging of the four. Here, AV1 crossed the 93 VMAF threshold at about 2.05Mbps, with HEVC crossing at about 2.7 Mbps, VP9 at about 3.05 Mbps, and H264 at about 4.05 Mbps. This means that AV1 shaved roughly 24% of the data rate of HEVC while delivering the same quality, 33% from VP9, and 49% from H264. We’ll return to these numbers in a second.
Figure 3. Scores from the football clip.
Table 2 shows the BD-Rate and BD-Quality calculations. Briefly, BD-Rate shows the data rate savings associated with using the AV1 codec. Looking at the overall average, AV1 could deliver the same quality as x265 with a 34.88% data rate reduction, the same as VP9 with a 37.69% reduction, and the same as x264 with a 54.82% reduction. Looking at the Football clip, the numbers are 24.89%, 35.5%, and 50%, which roughly track the observations made about crossing the 93 VMAF threshold above.
BD-Quality reverses the variables: for equivalent bandwidth, it finds the average quality improvement. So, on average, at the same bitrate as HEVC, VP9, and x264, AV1 would deliver 1.25, 1.48, and 3.15 additional VMAF points, respectively. To put this in perspective, a VMAF change of six equals a “just noticeable difference” that should be detected by more than 75% of viewers. That means that if you substituted AV1 for any of the three codecs at the same data rate, most viewers would not notice the difference. This is likely more the unfortunate result of the lack of range in encoding scores than true qualitative differences in the clips, but the numbers are the numbers.
Table 2. BD-Rate and BD-Quality scores versus AV1.
Basically, the BD-Rate numbers tell you that AV1 can save significant bandwidth when substituted in for x265, VP9, and especially x264. However, the perceptible quality differential between videos encoded by the respective codecs isn’t that significant for the tested clips at the tested encoding configurations.
I discussed these results with Nigel Lee from Euclid IQ, who’s helped me understand how to apply and interpret BD-Rate and other objective metrics in the past. He recommended that we test more challenging clips at more challenging data rates. So, despite the long encoding times, I decided to take Nigel’s advice.
Objective Quality—Round 2
Specifically, I added two test clips, one the initial runway sequence of Zoolander, which is very challenging, and a different 30-second section of the Football clip with much more motion. I also decided to test at more aggressive data rates, in six steps from 1Mbps to 3.5Mbps inclusive. Figure 4 is the rate-distortion curve, which shows much more differentiation between the clips; not only between AV1 and HEVC, but between HEVC and VP9, and H.264 and all other codecs.
Figure 4. Updated results for two more challenging clips at more aggressive data rates.
Table 3 shows the numbers. Though the BD-Rate differential between AV1 and x265 shrank significantly, the BD-Quality value almost tripled, so at the same data rate, AV1 would produce a VMAF score averaging 2.99 points higher. Still not visible to most viewers but getting closer. At 5.55 points for LibVPx, many viewers would notice a difference between videos encoded at the same data rate, while the 16.79 score for x264 indicates that most prudent video publishers wouldn’t attempt these data rates with x264 (or any H264 codec).
Table 3. Round 2 BD-Rate and BD-Quality scores versus AV1.
What did we learn about testing? You should focus your tests on the data rates at which your video will most likely be deployed. At this point, H.264 and any newer codec should produce near-perfect quality at 6 Mbps, making that data rate irrelevant for forward-looking testing. HEVC and VP9 take the near-perfect quality level down to between 3.5Mbps to 4Mbps, and AV1 and future codecs should bring this down into the 2Mbps to 3.5Mbps range. For this reason, it makes the most sense to test in the range covered in the second round.
What about clip type? If a significant differential only appears in challenging clips, is it relevant for the vast majority of easier to encode clips? I would say yes. Remember that both test clips in the second series were challenging sequences from longer videos that were easier to encode on average than the tested segments. Even talk shows have challenging sections, whether the opening logo or quick cuts to the applauding audience. So, while the overall quality difference may be minor on generally easy-to-encode videos, AV1 should be able to cut the overall data rate and preserve the quality in hard-to-encode sequences within these clips.
Decoding
Decoding speed is shown in Figure 5. To assess this, I converted the 6Mbps Elektra file to Y4M format using a simple FFmpeg script and recorded approximate performance as measured by FFmpeg during the conversion. As you can see in the figure, AV1 decoded at .66x real time, with HEVC at 8x real time, VP9 at 10.5x and H264 at 14x.
You can also see that with CPU utilization stuck at about 20%, AV1 was using only one of the available four cores on the system. If AOM can recode the player so it can address more than a single core, real-time playback of 1080p video may be possible on the Zbook, though CPU utilization will be much higher than for the other formats which means a diminished battery life.
I also attempted to play all four files using FFplay while recording CPU utilization in Performance Monitor. H.264, HEVC, and VP9 all played in real time with minimal impact on CPU, which tended to indicate some form of GPU-based encoding or decoding in hardware in the CPU. AV1 wouldn’t play at all, which is probably not surprising for a codec still in its experimental status.
Figure 5. AV1 decode suffers from a lack of hardware acceleration.
Absent a more efficient decoder, these decoding numbers don’t bode well for playback performance of AV1 on devices without GPU or some other form of hardware acceleration that likely won’t hit the market until early 2020. The E3-1505M v5 Xeon processor in the Zbook is a pretty robust CPU, and it doesn’t look like it could play a 1080p file.
Summary
These tests revealed glimpses of very alluring quality as compared to existing codecs, but at a current encoding cost that’s far beyond what the vast majority of video publishers can afford to pay. How many devices will play AV1 without some form of hardware acceleration is also in question, though again, this may be easily fixable. While you should expect encoding and decoding performance to improve pretty quickly, it’s hard to see AV1 as relevant for most producers for at least 12 to 18 months.
The author wishes to thank Nigel Lee, Chief Science Officer at EuclidIQ, for his high-level review of the quality discussion. Dr. Lee did not review the measurements or calculations, so any errors are those of the author.