The launch of FFmpeg 4.0 gave many compressionists their first chance to test the new AV1 codec, which is included in experimental form. For the first time, you had a single encoder that could produce all relevant codecs: H.264 with the x264 codec, HEVC with the x265 codec, VP9 using the Google Libvpx-vp9 codec, and AV1 using the LibAOM codec.
Since I was updating my book, Learn to Produce Video with FFmpeg in 30 Minutes or Less, for FFmpeg version 4.0, I decided to dig into the AV1 encoding parameters. After running some competitive encodes on a single 5-second clip for the book, I proposed an expanded test for Streaming Media.
I originally thought to test with four 5-second clips. Once I learned that my editor was embarking on a vacation, I decided to increase that to four 30-second clips. Some deficiencies in my test clips and procedures led to a second round with two 30-second clips.
Given the extra time, I solicited input from Google for the AV1 and VP9 encodes and MulticoreWare, the developer of x265, for the x265 encodings. All input is included in the encoding parameters shown below. After providing input, a representative from MulticoreWare added, “I believe it is not going to be a fair comparison between one commercial encoder and other reference encoders irrespective of any quality metric (PSNR/SSIM/VMAF). The question is why comparing between x265 vs. AV1/VP9, why not HM vs. AV1/VP9. In the latter case, it will be a real comparison between HEVC and AV1/VP9.”
HM is the HEVC reference encoder. In my experience, a reference encoder encodes with all parameters set for maximum quality, showing off the full quality available for the technology without concern for encoding speed. This didn’t accurately describe my AV1 encoding parameters.
Rather, with the AOM encoder, you set the encoding speed/quality tradeoff using the -cpu-used switch, which can range from -8 to +8, with lower values delivering higher quality (https://trac.ffmpeg.org/wiki/Encode/AV1). As you’ll see in our encoding script below, for the first pass, we set this at the lowest quality (+8) and returned it to the mid-range (0) for the second pass. So, we actually used mid-range settings for our encodes, not the full quality settings.
I asked Google to comment on MulticoreWare’s comment and haven’t yet heard back. From my perspective, it seemed reasonable to test all codecs using the same encoder but wanted to share MulticoreWare’s concerns.
For those with TL;DR tendencies, here’s the net/net. Though AV1’s quality was impressive, encoding times are simply too long for all but the very tippy top of the VOD pyramid. If your videos aren’t watched millions of times, AV1 encoding is unaffordable.
Playback testing was a bit discouraging, though this may be fixable by improving the AV1 decoder’s ability to utilize multiple cores. If this isn’t possible, it may mean that AV1 is impractical to deploy until hardware playback support is available, which means 2020 for most mobile devices.
I initially tested with four video-only 30-second test clips; one from the movie Elektra, another from a football test clip available on the Harmonic website, an excerpt from Netflix test clip Meridian, and a concert video of country singer Josiah Weaver singing the song “Freedom.”
I encoded all test clips with FFmpeg version ffmpeg-20180728-eb94ec3-win64-static. I kept the encoding scripts basic. I sent my proposed script to MulticoreWare for their review, and they recommended “Default AQ mode in x265 is 1. We should enable AQ mode 2 while comparing with other codecs (enable –tune-ssim). There is a huge difference between aq mode 1 and 2 in terms of PSNR and SSIM.” So, I tuned for SSIM and used the Veryslow preset, which in my tests has consistently produced better quality than Placebo.
ffmpeg -y -i input.mp4 -c:v libx265 -preset veryslow -tune ssim -x265-params bitrate=6000:vbv-maxrate=12000:pass=1 -f mp4 NUL & \
ffmpeg -i input.mp4 -c:v libx265 -preset veryslow -tune ssim -x265-params bitrate=6000:vbv-maxrate=12000:pass=2 output_HEVC.mp4
I followed the same lead for H.264, using this script:
ffmpeg -y -i input.mp4 -c:v libx264 -preset veryslow -tune ssim -b:v 6000K -maxrate 12000K -pass 1 -f mp4 NUL & \
ffmpeg -i input.mp4 -c:v libx264 -preset veryslow -tune ssim -b:v 6000K -maxrate 12000K -pass 2 output_H264.mp4
I used the following script for VP9 after running it by Google.
ffmpeg -y -i input.mp4 -c:v libvpx-vp9 -pass 1 -b:v 6000K -threads 8 -speed 4 -tile-columns 4 -auto-alt-ref 1 -lag-in-frames 25 -frame-parallel 1 -f webm NUL && \
ffmpeg -i input.mp4 -c:v libvpx-vp9 -pass 2 -b:v 6000K -minrate 6000K -maxrate 12000K -threads 8 -speed 0 -tile-columns 4 -auto-alt-ref 1 -lag-in-frames 25 -frame-parallel 1 output_VP9.webm
And used this script for AV1, again after running it by Google.
ffmpeg -y -i input.mp4 -c:v libaom-av1 -strict -2 -b:v 6000K -maxrate 12000K -cpu-used 8 -pass 1 -f matroska NUL & \
ffmpeg -i input.mp4 -c:v libaom-av1 -strict -2 -b:v 6000K -maxrate 12000K -cpu-used 0 -pass 2 output_AV1.mkv
I encoded each clip at 5 data rates, 2Mbps to 6Mbps inclusive.
As mentioned, when I originally started the review process, I intended to use five-second clips and had already performed the encodes. When I switched to 30-second clips, I extracted the new source clips and swapped them in the batch files. I created separate FFmpeg scripts for all 20 AV1 test files that I ran in separate Command windows, and it took twelve days and five hours to finish the encodes on my 40-core HP Z840 workstation. With all 20 encodes running, CPU utilization was around 45%, though this dropped as the lower data rate encodes completed.
While encoding and analyzing the HEVC, VP9, and H264 clips, it struck me that while using only one I-frame for a 5-second clip was reasonable, I probably should have switched to a 2-second I-frame interval for the 30-second clip. I did so for the second round but stayed with the default keyframe interval of 250 frames for these encodes.
To test encoding and decoding speed, I wanted to use a single-CPU computer rather than a 40-core workstation. So, I tested encode/decode performance on an HP ZBook notebook powered by a 2.8 GHz Intel Xeon E3-1505M v5 CPU with an NVIDIA Quadro M1000M graphics chipset along with the HD Graphics 530 GPU embedded in the CPU (Figure 1). For the encoding test, I encoded a five-second excerpt from Tears of Steel.
Figure 1. Here are the specs on the single CPU test machine used for encoding and decoding trials.
Table 1 presents encoding times in seconds, and how the encode relates to real-time performance. For these tests, I encoded with version ffmpeg-20180716-8aa6d9a-win64-static because the later version I used for the encodes didn’t work on the notebook.
The AV1 encode took 62 hours and 48 minutes, which was 45,216 times longer than real time. This compared to under five minutes for HEVC and VP9, and 58x and 45x real time, respectively. Google advised that the encoding speed had dropped in later versions, but obviously, it has a long way to do.
Not to belabor the point, but encoding times translate to encoding cost. For example, if you run your encoding farm in the cloud, you will spend roughly 800 times more to encode AV1 than HEVC. Obviously, you can only recoup this cost if the bitrate savings are substantial and spread over millions of views.
Objective Quality—Round 1
I measured VMAF quality with the Moscow University Video Quality Measurement Tool. The overall average VMAF scores for the four clips are shown in Figure 2. As you can see, AV1 was the clear leader, H264 the clear laggard, and x265 and VP9 neck and neck in the middle. No surprises there.
Figure 2. Average VMAF scores for our 4 test clips.
Looking at the results, the range of scores is disappointing. Generally, a VMAF score of 93 predicts a clip without disturbing visual artifacts, and on average, all technologies except for H.264 averaged close to this number at 2Mbps and easily exceeded it for all higher data rates. I would have liked to have seen some scores in the 70s and 80s for HEVC and VP9.
Figure 3 presents the results from the football clip, which was the most challenging of the four. Here, AV1 crossed the 93 VMAF threshold at about 2.05Mbps, with HEVC crossing at about 2.7 Mbps, VP9 at about 3.05 Mbps, and H264 at about 4.05 Mbps. This means that AV1 shaved roughly 24% of the data rate of HEVC while delivering the same quality, 33% from VP9, and 49% from H264. We’ll return to these numbers in a second.
Figure 3. Scores from the football clip.