In August 2020, the Alliance for Open Media created a software working group to “use the Scalable Video Technology for AV1 (SVT-AV1) encoder developed by Intel…to create AV1 encoder implementations that deliver excellent video compression across applications in ways that remove computational complexity trade-offs for an ever-growing video delivery marketplace.” Testing published around that time indicated that SVT-AV1 had quite a hill to climb to stand out among other AV1 codecs.
For example, in a comparison published a month later, I found SVT-AV1 last among the four AV1 codecs tested (libaom, Visionular Aurora 1, aomenc, SVT-AV1), though only about 3% less efficient than FFmpeg/libaom-AV1. At the time, SVT-AV1 had several critical deficits, including a two-pass rate control that was incomplete. In its December 2020 report, Moscow State University found SVT-AV1 was four percentage points behind the Alliance for Open Media’s (AOMedia) libaom and 25 percentage points behind Visionular’s Aurora1 codec.
With the recent launch of version 1.0, SVT-AV1 appears to have caught up with libaom in quality with very definite performance advantages. Two-pass rate control is tested and proven. If you’re creating an AV1 encoding workflow today that emphasizes encoding speed and quality, SVT-AV1 would definitely be on your short list.
About Scalable Video Technology
Let’s start with a brief introduction to what Scalable Video Technology (SVT) is and how it works. According to this Intel white paper, “The SVT architecture is designed to maximize the performance of an SVT encoder on Intel Xeon Scalable processors. It is based on three-dimensional parallelism.” Most important of the three is segment-based parallelism, which “involves the splitting of each picture into segments and processing multiple segments of a picture in parallel to achieve better utilization of the computational resources with no loss in video quality.”
This technique falls counter to the view that encoding each frame in its entirety delivers the best quality. For example, here’s a quote from the Avidemux encoding guide about slices. “H.264 allows the encoder to segment each frame into several parts. These parts are called “slices. The advantage of using multiple slices (per frame) is that the slices can be processed independently and in parallel. This allows easy multi-threading implementations in H.264 encoders and decoders. Unfortunately using multiple slices hurts compression efficiency! The more slices are used the worse!”
So, a big part of what SVT attempts to do is split the picture to gain the processing efficiency while retaining quality. Early efforts were not encouraging. As shown in Figure 1 from the aforementioned Moscow State report, SVT-HEVC was 51 percentage points behind x265 and SVT-VP9 was an astonishing 129 percentage points behind VP9, which made the 4 percentage point delta between SVT-AV1 and aom seem like a breakthrough.
Now that you’re familiar with SVT-AV1, let’s explore the encoding parameters that I used and the quality comparisons.
For the record, I tested version 1.0.0 of SVT-AV1 as provided by a member of the Intel SVT-AV1 development team. I tested FFmpeg version 2022-06-09-git5d5a014199 downloaded from www.gyan.dev. I performed all encoding tests on 40-core HP Z840 workstation running Windows 7 on two Intel Xeon E5-2687W v3 CPUs running at 3.10 GH with 32 GB of RAM.
Choosing a Preset
Codec developers create presets to configure groups of encoding parameters that control the encoding time/encoding quality tradeoff. This allows codec users to choose the level of cost and quality appropriate for their particular application. Whenever you start working with a new codec or encoder, you should benchmark the codec with your own source footage to explore these tradeoffs and make the best decision for you.
To do this, you should select several representative test clips, encode them using all the presets and otherwise consistent settings, time the encode and measure the quality. With FFmpeg you control the AV1 preset using the -cpu-used switch with settings ranging from 0 to 8 and a default setting of 1.
Table 1 shows the average results for two ten-second test clips when encoding with FFmpeg and libaom-AV1. To explain, with preset 0, the highest quality preset, it took an average of 3:24:33 (hour:min:sec) to encode a ten-second test clip (which is why it’s challenging to test with longer clips). With the fastest/lowest quality preset, it took 1:06 (min:sec). This tells you that on this test bed, FFmpeg/libaom-AV1 isn’t capable of encoding a live stream; in fact, the best performance is close to 7x real-time.
For a measure of overall quality, I use VMAF computed via the harmonic mean method (see here for an explanation of harmonic mean). To assess transient quality, I use low-frame VMAF, which is the lowest VMAF score for any frame in the test file.
In the Delta row on the bottom, the time delta divides the slowest score by the fastest and shows that the slowest score took 187.37 times longer than the fastest. You also see that overall VMAF difference between the fastest and the slowest preset is 3.77. For perspective, Netflix has stated that a difference of 6 represents a just noticeable difference (JND) that viewers will notice, though other researchers have found that 3 VMAF points constitutes a JND. Either way, it’s not a significant difference between the highest and lowest quality preset, particularly, as you will see, compared to SVT-AV1.
To visualize the encoding time/quality tradeoff, I plot the three factors – time, VMAF, and low-frame VMAF – for each preset on a scale from 0 (fastest preset/lowest quality) to 100 (slowest preset/highest quality. You see this in Figure 2.
Every application is different, and every producer dances to their own particular tune. With my fictional VOD content-producer hat on, I see preset 4 as the logical starting point, with a substantial jump in both VMAF and low-frame VMAF. Do I increase encoding costs by roughly 50% to achieve a .4 VMAF improvement with preset 3? Probably not. Unless I’m shipping extremely high stream volumes, I don’t consider presets 2, 1, or 0.
As an aside, though we only look at encoding time and quality in this analysis, a third factor, bandwidth, is also in play. That is, all publishers should start this analysis with a target quality level they will achieve by choosing a preset and bitrate. With preset 2, the bitrate necessary to achieve that target quality level will be less than for preset 3, so bandwidth savings will increasingly offset the encoding time costs as viewing volume increases.
At relatively low volumes, choosing a faster preset and saving on encoding time is probably the best strategy. If your streams will be viewed hundreds of thousands of times or more, it might make more sense to pay more for encoding and save bandwidth. I explore these issues in an article entitled, Choosing an x265 Preset – An ROI Analysis. For most producers, I would assume that preset four or three are the most relevant choices for FFmpeg/libaom-AV1.
Choosing a Preset – SVT-AV1
Now let’s look at SVT-AV1. Table 2 shows the same data points for SVT-AV1 presets 0-12, with an actual range of -2-13 and a default of 10. The results reveal several obvious points.
First, the ranges of encoding time, VMAF and low-frame VMAF are much, much greater. In particular, three presets (10, 11, and 12) are capable of real-time encoding with preset 9 very close, though the quality disparity is significant, extending to two JND by Netflix’s numbers and close to 3 JND for low frame.
Figure 3 charts the encoding time/quality tradeoff. From a VOD perspective, it appears that preset 6 is the starting point, with most producers choosing somewhere between 2 and 4. As detailed above, as the anticipated view counts for your videos increase, you should gravitate towards a higher-quality preset.
In terms of the bigger picture, the range of performance and quality makes SVT-AV1 much more usable than libaom-AV1, enabling as mentioned, even live AV1 applications. I don’t know what configuration options are available within libaom-AV1, but it would be useful if its developers explored ways to broaden the spread of encoding time and quality to make this codec as flexible as SVT-AV1.
Choosing the Thread Count
Now that we’ve chosen a preset let’s cover threads. This analysis will help you understand which thread count to include in your command string, and help you choose the optimal cloud-instance or encoding strategy on a multiple core computer.
With FFmpeg/libaom-AV1, you control the number of CPU threads applied to the encode with the -threads command. Table 3 shows the analysis that I go through when attempting to identify the optimal setting for any configuration option. The baseline column shows the result when no setting is in the command string, which invokes the default setting. Each subsequent column shows the results from configuring the otherwise identical command string to use 1, 2, 4, 8, 16, and 32 threads on the 40-core HP workstation. The Delta column shows the difference between the highest and lowest scores.
You see the results in encoding speed, bitrate, and three quality variables, harmonic mean VMAF, low-frame VMAF, and standard deviation, the last a measure of quality variability in the stream. The green background identifies the best score, the yellow background the worst.
In terms of performance, not surprisingly, we see that 1 thread is the slowest option by far. We also see that while 16 threads is the fastest setting, the performance difference between 16 and 8/32 is negligible. From this, I’d guess that the maximum number of threads libaom-AV1 can utilize is 8.
Surprisingly, the single-threaded encode was the lowest quality in all three measures, though the Delta column shows that the differences are irrelevant. The quality results for almost all other alternatives are identical, so production efficiency should be the focus. Clearly any setting over 8 threads makes no sense, and if you’re provisioning cloud instances, 8 should be the maximum as well. But is 8 the optimal thread count? Table 4 tells the tale.
Using the average encoding times shown in Table 3, Table 4 computes the number of hours it would take to encode an hour of AV1 video using each thread count. Then it adds the hourly cost of AWS compute instances from here, and computes the cost per hour for the four thread counts shown.
Interestingly, you achieve the cheapest cost per hour using a single-threaded machine. Why would this be? Because as shown in Figure 4, the encoding cost increases linearly while the additional threads deliver increasingly lower speed increases. Going from one thread to two doubles the cost but only increases encoding speed by 1.8x. Going from one thread to eight increases costs by 8x but only increases throughput by 2.99x.
Of course, this analysis assumes that the work involved in provisioning and managing many more encoding stations doesn’t outweigh the cost savings. Either way, provisioning encoding stations with more than 8 cores likely doesn’t make economic sense, and lower thread counts might be more cost efficient.
Working Efficiently on Multiple-Core Encoding Stations
The same logic should apply to spreading production encodes over a multiple-core workstation. On a sixteen-core workstation, for example, you might achieve faster throughput with four encodes using 4 threads each as opposed to two encodes using 8 threads.
Of course, running multiple encodes adds some overhead that slows overall operation. For example, on my 40-core workstation, a single encode of the 10-second Football test clip took 4:23 (min:sec). When I encoded eight files simultaneously, the average time increased to 5:49, about 32% higher. Still, if you have the ability to deploy multiple instances on a single workstation, some experiments with different thread values will provide useful direction.
Choosing the Optimal Thread Count with SVT-AV1
Given the explanation of SVT shared above, you’d expect better performance at higher thread counts, and SVT delivers. Still, as you’ll see, the same analysis does less to sell multiple-core Xeon processors than you might think.
Table 5 shows the encoding speed/quality tradeoff associated with SVT-AV1’s -lp switch, which controls the number of logical processors assigned to any encoding task. Baseline is fastest because it appears to assign all logical processor to the task, though baseline is only slightly faster than 32-threads.
From a quality perspective, here a single thread delivers the best quality, but the delta is irrelevant. This makes encoding throughput and cost the most important factors in choosing the thread count (and -lp value). In this regard, the surprisingly diminishing speed returns from the additional threads dictates the results shown in Table 6.
As you can see, the jump from 1 thread to 8 threads delivers slightly more in throughput than Amazon charges for CPUs, making 8 threads the cheapest encoding option by a hair. From there, however, the decreasing speed increase means an ever-increasingly cost per hour for higher thread counts. These findings suggest that encoding configurations that exceed eight threads might not be cost effective.
These results come with all the usual caveats; your findings may certainly vary. I performed these tests on 1080p 8-bit content; the results for 4K and 8K HDR footage might be completely different. I’m also predicting cloud throughput from results posted by an older desktop machine; results on newer instances may be different. Intel vs. AMD is another potential differentiator.
The high-level point is that with both libaom-AV1 and SVT-AV1, you shouldn’t assume that more cores deliver the most cost-effective throughput. If you’re getting ready to scale up your AV1 encoding and you need to figure out which workstations to buy or which cloud instances to provision, a day or two of this kind of testing with your simple footage and target output should provide very clear direction.
Which takes us to our quality bakeoff.
Here’s the encoding string that I used for FFmpeg/libaom-AV1, with options in green the default options. This means that you’d get the same result if you removed them. I like to leave them in because it simplifies comparing the string to those used in other comparisons.
Note that I tested with -cpu-used 8 in the first pass and -cpu-used 4 in the second. That’s because the quality used in the first pass doesn’t impact overall quality. I tested with threads set to 8 for maximum single-encoding-instance throughput on my workstation.
ffmpeg -y -i Football_10.mp4 -c:v libaom-av1 -b:v 1500K -g 60 -keyint_min 60 -cpu-used 8 -auto-alt-ref 1 -threads 8 -tile-columns 1 -tile-rows 0 -row-mt 1 -lag-in-frames 25 -pass 1 -f matroska NUL & \
ffmpeg -y -i Football_10.mp4 -c:v libaom-av1 -b:v 1500K -maxrate 3000K -g 60 -keyint_min 60 -cpu-used 4 -auto-alt-ref 1 -threads 8 -tile-columns 1 -tile-rows 0 -row-mt 1 -lag-in-frames 25 -pass 2 Football_1.mkv
Here’s the command string used for SVT-AV1. For these tests, I wanted to get as close to the same encoding time for both codecs as possible. With -cpu-used 4 in the second pass, FFmpeg delivered the files in 4:24 (see Table 1). I used preset 3 for SVT-AV1 as it delivered the files in a slightly faster 3:48 (Table 3). Note that I used three-pass encoding to encode all SVT-AV1 output produced for this article, though the first and second passes are very, very fast.
SvtAv1EncApp -i input.y4m --rc 1 --tbr 1500 --mbr 3000 --keyint 2s --preset 3 --passes 3 --lp 8 --tile-columns 0 --tile-rows 0 --enable-tf 1 -b output.ivf
I also used -lp 8 for throughput and to match the libaom setting.
Overall, I tested 17 files ranging in duration from one to four minutes with four encodes each to produce output to present in a rate-distortion graph and use to compute BD-rate results. I’m told that adding the results to present a composite graph is mathematically incorrect, but I find it useful as a general gauge of the overall result. So please don’t show Figure 5 to your mathematically inclined colleagues.
As you can see, SVT-AV1 wins at lower bitrates while libaom prevails at higher bitrates. Overall, according to the BD-rate composite computation, SVT-AV1 produced the same quality as libaom-AV1 with a bitrate savings of 1.36%.
Feeling a bit let down? Read all the way to the end only to find that SVT-AV1 delivered only a minuscule bandwidth savings? When I last reviewed SVT-AV1, the codec needed to increase bandwidth by 4% to match libaom-AV1 quality and was actually slower as tested.
Now, SVT-AV1 slightly exceeds libaom-AV1 quality while enabling software-based live AV1 encoding. Not bad for version 1.0. While this may not trigger a mass exodus from libaom-AV1 to SVT-AV1, it does enable a completely different set of potential AV1 applications which can only accelerate AV1 adoption.
During my tests, I had to convert the source MP4 files to Y4M format to encode with the SVT-AV1 standalone encoder. Obviously, operation within FFmpeg would eliminate this and simplify integrating SVT-AV1 encoding into existing FFmpeg-based workflows.
While you can access SVT-AV1 within some FFmpeg builds, it’s single-pass only. I asked a colleague in the if and when FFmpeg might get full support for SVT-AV1, and he noted that SVT-AV1 supports 2 and 3 passes, while FFmpeg supports only two-pass. Apparently adding the 3-pass capability to FFmpeg is a lot of work which probably won’t happen at least until the end of 2022.
Another open question is the continued vitality of the libaom-AV1 codec in FFmpeg given that the Alliance for Open Media has focused its software working group on SVT-AV1. I sent a question to a contact at the Alliance for Open Media about its plans to keep updating libaom-AV1, and AOMedia’s own standalone encoder (aomenc), but hadn’t heard back by press time. Check the website for any updates.
Looking at prominent AV1 publishers, YouTube has been producing AV1 with FFmpeg/libaom-AV1 for years. Since switching over to SVT-AV1 in YouTube’s FFmpeg-based encoding farm would require significant resources for modest gains for VOD production, it seems likely that AOMedia will continue to support libaom-AV1 (and its largest user) at least until full use of SVT-AV1, including 3-pass encoding, is available within FFmpeg, and probably a whole lot longer.