While 2018 was the year AV1 became known, 2020 will be the year that AV1 became interesting, primarily because of three developments. First, in early 2020, AV1-enabled smart TVs hit the market, right on the 2-year schedule announced back in 2018 by the Alliance for Open Media (AOMedia). Second, over the past two years, encoding times for the AOMedia AV1 codec have dropped from about 2500x real-time to about 2x slower than HEVC. Finally, the emergence of third-party AV1 codecs has increased both the quality and encoding speed of the AV1 codec.
In short, in 24 months, hardware support appeared, encoding became affordable, and AV1 became a much more realistic competitor to HEVC. Let the vast migration to AV1 encoding begin.
Meet the AV1 Codecs
My focus for the review was developer-level products for integration into an existing encoding workflow as opposed to commercial cloud encoding or standalone encoders. I tested four separate AV1 encoders:
- AOMedia’s standalone aomenc, v 2.0.0
- Intel/Netflix’s SVT-AV1 codec, v 0.8.4 confirmed with version 0.804-22
- Visionular’s Aurora1 codec, v 2.0.1
- libaom, the AOMedia codec in FFmpeg, v 2.0.0
aomenc is AOMedia’s standalone encoding executable and is available here. During the testing process, I sent questions and encoding strings to Google, which was kind enough to lend assistance.
SVT-AV1 is an AV1 codec implementation created by Intel and Netflix and is available on GitHub. When I contacted Intel during testing, I received the following response: “Using SVT-AV1 at this time would be potentially a misrepresentation of the project as you’d be evaluating codec capabilities that are not currently implemented or are a [work in progress].” The contact cited three reasons, which I’ll paraphrase:
- Development work on streaming and broadcast-level applications is yet to start.
- Intel hasn’t yet developed a tuning-for-VMAF feature.
- Two-pass rate control is not yet completed.
From my perspective, the second reason wasn’t an issue because I wasn’t tuning for VMAF with any of the codecs (more on that later). I decided to keep SVT-AV1 in the analysis because all codecs are works in process, I had already invested substantial time learning to use the codec and producing the test clips, and the codec was freely available on GitHub. Once I shared that with Intel, it reviewed my command strings and provided other guidance.
Visionular describes itself as a “next generation video encoding and image processing technology software company.” It supplied its Aurora1 AV1 encoder to me as an executable file with instructions.
Finally, I tested FFmpeg 4.3.1 (build number ffmpeg-20200727-16c2ed4-win64-static) as downloaded from Zeranoe, which included the libaom, x264, and x265 codecs.
When I last tested AV1 codecs, I used four 5-second test clips because encoding times were so lengthy. This time, I started with 10-second clips and learned that AV1 encoding times had dropped significantly. So I expanded the test list to 16 clips in five categories, as shown in Table 1.
To produce the data necessary for the rate-distortion curves and BD-Rate statistics shown later, I needed four encodes for each clip. To choose the data rates, I targeted a VMAF range of between 85 and 95, which is the target for the highest-quality streams produced in most encoding ladders. The only exception was the gaming clips, which were all 60 fps and so dynamic as to require 10Mbps-plus to achieve the same quality levels. So I encoded at 5,750Kbps at the top rung and scaled down from there.
Before laying out the command strings, let’s discuss presets and tuning.
AV1 Encoding Presets
Virtually all codecs include presets because streaming producers have different goals and business models. Some may want to minimize encoding time and costs by accepting slightly lower quality. Some may want the utmost quality regardless of encoding time or cost. Presets allow each streaming producer to optimize for its particular requirements.
AV1 has nine presets selected via the cpu-used switch. My goal was to choose the most commercially reasonable preset that most producers would use, not the highest-quality AV1 preset. To identify this preset, I encoded 10 of the test files with FFmpeg/libaom from five genres to all presets and recorded encoding time, average quality, and low frame quality, the last of which is often useful to spot transient quality issues. I plotted these values as a percent in Figure 1, with red signifying encoding time; blue, average quality; and yellow, low frame quality.
As an example, at cpu-used 3, the encoder averaged 99.53% of average quality, 99.39% of low frame quality, and 6.4% the encoding time of the highest-quality preset, cpu-used 0. Encoding with cpu-used 2 would triple encoding time with minimal additional quality, while cpu-used 4 would increase encoding speed by about 33% with minimal quality loss.
Given that the quality was quite high even for the lowest-quality preset, you could argue that cpu-used 8 was commercially reasonable. However, Visionular recommended using its “slower” preset, which corresponded with cpu-used 3. After confirming similar performance for the Aurora1 and aomenc codecs. I tested these three codecs using cpu-used 3.
Interestingly, SVT-AV1 showed a different quality/encoding time tradeoff, with an unexpected quality bump at cpu-used 7, then a drop in quality through cpu-used 2 (see Figure 2). Although the jump from cpu-used 3 to cpu-used 2 tripled encoding time, it also boosted quality from 95.07% to 99.72%, an increase most producers would likely find reasonable. So I encoded all SVT-AV1 clips using preset 2, although I also tested preset 7 in the performance tests discussed later.
x264 and x265 use presets with names like ultrafast, superfast, medium, slow, and placebo. Typically, I recommend the slow preset for x265, which offers a significant boost in low frame quality over the default medium preset and roughly doubles encoding time. However, in the x265 files produced for these comparisons, I encoded using the veryslow preset to optimize quality, although that boosted encoding time significantly. To provide a sense of this, I tested both the slow and veryslow presets in the performance tests discussed later.
Since x264 is orders of magnitude faster to encode than either x265 or any of the AV1 codecs, I used the veryslow preset, which typically offers slightly more quality than placebo in a fraction of the encoding time.
Before we get to the performance tests, let’s cover tuning.
AV1 Encoding Tuning Strategy
Tuning refers to the practice of applying tuning parameters to the command string to remove encoding techniques like adaptive quantization that are known to reduce metric scores. Tuning for metrics is generally well-established for x264 and x265, although not universally used. For example, when benchmarking AV1 against other codecs in recent analyses, neither Facebook nor Netflix tuned, preferring to use their actual production parameters. Nonetheless, because features like adaptive quantization are enabled by default for x264 and x264, I tuned when producing files to be rated via objective metrics.
The AV1 codec also has tuning mechanisms, with tuning for peak signal-to-noise ratio (PSNR) and SSIM available for a few versions, and tuning for VMAF relatively new. However, none of the AV1 codecs I tested enabled adaptive quantization by default, so what these tuning mechanisms were actually doing was unclear. Regarding tuning for VMAF, a Google engineer described this as “adaptive prefiltering (sharpening) of frames prior to standard encode. That boosts VMAF scores up to 30% in BD-rate terms, but of course, PSNR will not be very good.”
Unlike tuning for PSNR and SSIM with x264 and x265, which disables features that improve subjective quality but degrade metric scores, tuning for VMAF enables features that improve metric scores but may degrade subjective quality, which feels like cheating. So I didn’t tune for VMAF in any of the encodes. I will discuss tuning for PSNR/SSIM with the AV1 codecs following the x264/x265 discussion.
TUNING X264 AND X265
x264 offers two tuning mechanisms: PSNR and SSIM. To minimize administrative complexity, I wanted to use one mechanism for all files produced for metric testing. Although this analysis involves only VMAF and SSIMPLUS, I may incorporate PSNR, SSIM, and MS-SSIM in future testing. To choose the optimal tuning strategy for each codec, I encoded four files from four different genres with tuning for SSIM and PSNR enabled along with a file with no tuning. Then I averaged the result, which is shown for x264 in Table 2. The green scores were the highest average score.
As you can see in the table, tuning for PSNR delivered substantially higher VMAF and SSIMPLUS scores, was about even for PSNR, and produced very slight drops in SSIM and MS-SSIM. Accordingly, for x264, I tuned for PSNR. In a similar analysis for x265, tuning for SSIM produced the highest-quality score for all metrics, even PSNR. Accordingly, for x265, I tuned for SSIM.
TUNING FOR AV1 Encoding
Three of the four AV1 codecs, libaom, Aurora1, and aomenc, offered tuning for SSIM and PSNR; SVT-AV1 didn’t. I performed the same analysis on all of the AV1 codecs that offered tuning. Since the difference between either tuning method and not tuning was modest in almost all cases, I decided to simplify matters and not tune for these three codecs. SVT-AV1 doesn’t offer tuning, simplifying the decision for that codec.
Let’s move on to other encoding parameters and the strings themselves.
General Encoding Parameters and Encoding Strings
The general encoding parameters used are shown in Table 3 and are fairly generic.
With this as background, here are the encoding strings used for the different codecs and a bit about how I created them.
AOMEDIA’S LIBAOM (FFMPEG)
ffmpeg -y -i football_10.mp4 -c:v libaom-av1 -strict -2 -b:v 1890K -g 60 -keyint_min 60 -sc_threshold 0 -row-mt 1 -tile-columns 1 -tile-rows 0 -threads 16 -cpu-used 8 -pass 1 -f matroska NUL &
ffmpeg -y -i football_10.mp4 -c:v libaom-av1 -strict -2 -b:v 1890K -maxrate 3780K -bufsize 3780k -g 60 -keyint_min 60 -sc_threshold 0 -row-mt 1 -tile-columns 1 -an -tile-rows 0 -threads 16 -cpu-used 3 -pass 2 football_libaom_2.mkv
This script was developed with input from Visionular and reviewed by Google. It was also published for comment on several Streaming Learning Center articles:
aomenc.exe football_10.y4m --width=1920 --height=1080 --fps=30000/1000 --passes=2 --lag-in-frames=25 --end-usage=vbr --target-
bitrate=1890 --threads=16 --cpu-used=3 --kf-min-dist=60 --buf-sz=2000 --maxsection-pct=200 --kf-max-dist=60 -o football_
VISIONULAR’S AURORA1 CODEC
This script was supplied by Visionular:
aurora_av1enc.exe Football_10.y4m --fps=30000/1001 --passes=2 --tile-columns=1 --tile-rows=0 --end-usage=vbr --target-bitrate=1890 --threads=16 --kf-min-dist=60 --kf-max-dist=60 --buf-sz=2000 --maxsection-pct=200 --preset=slower -o Football_VS_2.webm
This script was developed from the Intel Scalable Video Technology for AV1 (SVT-AV1) Encoder User Guide and reviewed by Intel:
SvtAv1EncApp -i Football_10.y4m -w 1920 -h 1080 --fps-num 30000 --fps-denom 1001 --keyint 64 --lookahead 64 --irefresh-type 2 --rc 1 --tbr 1890 --vbv-bufsize 3780 --preset 8 -b Football_SVT_2.webm --output-stat-file stat_file.stat
SvtAv1EncApp -i Football_10.y4m -w 1920 -h 1080 --fps-num 30000 --fps-denom 1001 --keyint 64 --lookahead 64 --irefresh-type 2 --rc 1 --tbr 1890 --vbv-bufsize 3780 --preset 2 -b Football_SVT_2.webm --input-stat-file stat_file.stat
This is an FFmpeg script developed by me and used in multiple Streaming Media articles and tests documented on the Streaming Learning Center:
ffmpeg -y -i Football_10.mp4 -c:v libx264 -threads 16 -b:v 1890K -preset veryslow -g 60 -keyint_min 60 -sc_threshold 0 -tune ssim -pass 1 -f mp4 NUL &
ffmpeg -i Football_10.mp4 -c:v libx264 -threads 16 -b:v 1890K -maxrate 3780K -bufsize 3780k -preset veryslow -g 60 -keyint_min 60 -sc_threshold 0 -tune ssim -pass 2 Football_x264_ssim_4.mp4
This is another FFmpeg script developed by me and used in multiple Streaming Media articles and tests documented on the Streaming Learning Center:
ffmpeg -y -i Football_10.mp4 -c:v libx265 -threads 16 -preset veryslow -tune ssim -x265-params bitrate=1890:keyint=60:min-keyint=
60:scenecut=0:open-gop=0:pass=1 -an -f mp4 NUL &
ffmpeg -y -i Football_10.mp4 -c:v libx265 -threads 16 -preset veryslow -tune ssim -x265-params bitrate=1890:vbv-maxrate=3780:vbv-bufsize=3780:keyint=60:min-keyint=60:scenecut=0:open-gop=0:pass=2 -an Football_x265_ssim_4.mp4
So that’s how I chose the encoding settings. How did performance compare? As promised, you see this in Table 4, which shows the average time, bitrate, and quality results from two separate 10-second encodes. It has three major sections. The top section, AV1 Codecs as Tested, shows the performance achieved as tested in the quality comparisons via the encoding strings shown previously.
Here, we see that Visionular is about 30% faster than FFmpeg at about the same quality level, which should translate to a concomitant reduction in encoding cost if you’re running your own encoding farm. However, SVT-AV1, at cpu-used 2, is about 33% slower than FFmpeg and 125% slower than Visionular.
The second section, Other Codecs as Tested, shows x264 and x265 as configured for the quality comparisons. Here we see that Visionular and x265 take approximately the same time to encode, although Visionular produces higher quality. For reference, it takes about 6 VMAF points to achieve a “just noticeable difference,” or JND, that 75% of viewers will see. So the 2.6 VMAF point differential between x265 and the AV1 codecs probably won’t be noticeable to most viewers. However, while x264 is by far the fastest codec, VMAF quality trails by 13 VMAF points, or more than 2 JNDs, which certainly is meaningful.
The third section, For Reference, shows two other datapoints. First is x265-slow, which is eight times faster than veryslow, with a drop of only 2 VMAF points. So although producers vary, if you’re comparing the most practical preset for AV1 and x264, you should probably compare x265 using the slow preset versus AV1 using cpu-used 3.
Finally, we see that cpu-used 7 does substantially accelerate SVT-AV1 encoding time, although the predicted 2% quality drop actually proved to be 4.5%, which is still under the JND threshold, but getting closer. Note that all of the other AV1 codecs offered faster encoding modes, so just because I showed a fast encoding time for SVT-AV1 doesn’t mean that it’s the performance king. The reason I ran this test was the funky quality pattern in Figure 2 that appeared to show that cpu-used 7 was a unicorn with both high speed and superior quality. Alas, this was not so in these limited tests.
Anyone who’s run structured encoding comparisons will tell you that it’s never a straight line; you don’t know what you don’t know until you finish your analysis, and any major revelation along the way can invalidate days, if not weeks, of previous work. So it was with SVT-AV1, seemingly getting Intel’s revenge for including the codec in the analysis against its wishes. Here’s what happened.
By way of background, metrics like VMAF and SSIMPLUS output a single score for the entire file. When the numbers vary significantly, it’s useful to compare the scores over the duration of the video, as you can see in Figure 3 from the Moscow State University Video Quality Measurement Tool, which compares the highest-quality Sintel file output by Visionular (in red) and SVT (in green).
Not to date myself, and I never was a Lost in Space fan, but whenever you see a quality deficit like this at the start of the file, you have to wonder whether there are startup problems that only impact the first GOP or two, then resolve. These problems are irrelevant even in the context of a 2-minute test file, but very significant in a 10-second clip.
I checked the CSV VMAF output files and found that frames 46–48 rated 88.28, 88.40, and 86.68. Frame 49 was the start of the new GOP, and frames 49–51 rated 93.58, 93.23, and 93.05. Obviously, there was a problem.
To resolve it, I added a 4-second stub to the start of these files in FFmpeg, re-encoded, and then extracted the 4 seconds before re-measuring VMAF and SSIMPLUS. Figure 4 shows the same Sintel comparison after the fix. While Visionular was still higher, the differential was much less, and the scores were more realistic. If you test SVT-AV1 yourself, or read other reviews, be sure to check for these startup issues, which can invalidate scoring for shorter clips.
Please Just Tell Me the Dang Scores
OK, enough journey, let’s arrive. To recount, I tested with both VMAF and SSIMWAVE’s SSIMPLUS metric. In the VMAF comparison (see Figure 5), Visionular was the overall leader, with all AV1 codecs besting x265. Note that I left x264 out of the rate-distortion curves to better present the clumped datapoints at the top of the chart.
Table 5 shows the BD-Rate computations for all codecs, including x264, which I computed using the well-known Excel macro (go2sm.com/slcbdrate). The easiest way to read this is to pick a codec/line and then read column by column. Negative numbers mean the line/codec is more efficient than the column/codec; positive numbers, the reverse.
So choosing the Visionular line, the Aurora1 codec can produce the same quality as FFmpeg at a 6.41% lower data rate, the same quality as x264 at a 49.76% lower data rate, and so on. Negative numbers are good, and having all negative numbers on a single line identifies the highest-quality technology.
In contrast, when scoring with VMAF, you’d have to boost the data rate of your x265 encodes by 23.15% to produce the same quality as libaom, although x265 proved 35.20% more efficient than x264.
Figure 6 shows the rate-distortion curves for the SSIMPLUS metric. Although Visionular is still the overall leader, SSIMPLUS x265 crept in front of SVT-AV1.
The BD-Rate figures in Table 6 provide valuable additional detail. Again, having all negatives means that Visionular delivered the best quality, although the gap between x265 narrowed considerably. In both metrics, the quality difference between aomenc and libaom was minor, with the advantage going to aomenc.
Weaving an Imperfect Tapestry
Whenever I finish a data-intensive story like this, I think of the Greek myth of Arachne, whom Athena turned into a spider for weaving a perfect tapestry. As I learned back in fourth grade, that’s why all Greek weavers include at least one error in each weave.
This story has more than 2,000 data points among the various test files, codecs, data rates, encoding times, quality metrics, preset trials, tuning mechanisms, and the like. So although I did my best, I’m pretty sure I don’t have to deliberately insert an error to avoid Arachne’s fate. So, if you see anything that looks wrong, please advise. If you spot any obvious errors, I apologize in advance, and again, please let me know.
[Editor’s Note: The author has continued his testing and updated it since he submitted the article for publication in the September issue of Streaming Media magazine. As a result, some of the data points in this article are different from what appears in the magazine. The data in this version is the most recent and accurate.]