The predominant use for ASIC-based transcoders like NETINT’s Quadra Video Processing Unit (VPU) has been live transcoding for live 4K60P HDR streaming, cloud gaming, interactive video applications, and other high-volume applications. In this post, we explore using Quadra VPUs for large-scale VOD transcoding for applications like user-generated content sites and social media.
As you probably know, both Meta and Google have developed their encoding ASICs, with Google’s Argos ASIC reportedly replacing over 10 million CPUs dedicated to CPU-based transcoding. While few, if any, services shoulder the same encoding load as YouTube, our tests reveal that a fully loaded Quadra video server ($21,000) should be able to replace up to 46 servers transcoding via CPUs.
Contents
About NETINT, ASICs, and VPUs
Just a brief bit about NETINT for those unfamiliar with the company, technology, and products. NETINT was founded in 2015 and launched its first ASIC-based transcoding product in 2018.
ASIC stands for Application-Specific Integrated Circuit, which means a chip purpose-built for transcoding. As compared to general-purpose CPUs and GPUs, which perform many more functions and devote less real estate to transcoding, NETINT’s transcoding ASICs are less expensive, more efficient, deliver much greater throughput, and consume much less power.
NETINT’s current generation of ASIC-based transcoders is the Quadra, which we call a video processing unit, or VPU, because in addition to transcoding, it also performs scaling and overlay and has 15 TOPS of AI processing. Quadra is available as a standalone product with three form factors and also in the Quadra Video Server, an integrated Ubuntu-based server with ten Quadra VPUs.
All NETINT products are designed as drop-in replacements or for expansion of existing hardware or CPU-based transcoding systems. As such, you can control transcoding operations via FFmpeg, GStreamer, or an SDK with load management provided in the basic software stack.
Back to Our Analysis
Getting back to our analysis, let’s start with a look at quality. Early hardware transcoders of all designs – CPUs, FPGAs, and ASICs – were rightfully criticized for subpar quality. While it’s true that software-based transcoding can produce higher quality than most hardware (using multiple CPUs and/or many times slower than real-time), NETINT’s ASICs are used by several premium OTT brands worldwide and for many other applications. As you’ll see in this analysis, Quadra’s output quality is quite competitive vs. the x264 codec.
Clearly, though, most UGC sites use lower quality standards than premium content publishers. According to this article, premium services targeted around 95 VMAF points for their top rung, while YouTube targeted ~92 VMAF points, with Meta at ~85 VMAF points.
This got us thinking. Though most Quadra customers use the VPU for live transcoding, how would Quadra’s quality and throughput compare to x264 when targeting UGC quality levels?
Transcoding UGC: Test Procedures
To assess this, we downloaded 27 random test clips from YouTube’s UGC Dataset, described as “A large scale dataset containing YouTube User Generated Content intended for video compression and quality assessment research. This allowed us to evaluate suitability for UGC with actual UGC video files, each 20 seconds long.
Figure 2. We tested UGC clips from this YouTube database. Many (if not most) large-scale services use some form of content-adaptive encoding to choose the target bitrates for their encodes. Given the range of content we were evaluating, a single target bitrate for all 1080p encodes made no sense. To find a target appropriate for the UGC quality levels, we encoded all files with x264 using a CRF value of 27, which delivered a VMAF score of around 90-91. This provided a data rate target.
For x264, I created a single-pass command string that targeted the same bitrate but used 200% constrained VBR. A sample command string looked like this:
ffmpeg -y -i Anim_1.mp4 -b:v 3200k -maxrate 6400k -bufsize 6400k -g 60 -rc-lookahead 40 -an -c:v libx264 Anim_1_x264_3200.mp4 <\code>
I’m aware that 40 is the default value for lookahead with x264 but wanted to make sure that the command string matched that used by Quadra as closely as possible. I assumed that most services producing UGC use the x264 medium preset, which is the default, to achieve high throughput and low cost per stream. Since I didn’t specify otherwise in the command string, that’s the preset used by FFmpeg.
For the Quadra, I created a single-pass command string that used a 2-second VBV buffer and a 40-frame lookahead and enabled Rate-Distortion Optimization, a technique that improves quality slightly (for H.264) but also decreases throughput slightly. The Quadra command string looked like this:
ffmpeg -y -c:v h264_ni_quadra_dec -xcoder-params "out=hw" -i Anim_1.mp4 -y -c:v h264_ni_quadra_enc -xcoder-params "gopPresetIdx=5:RcEnable=1:intraPeriod=60:lookaheadDepth=40:EnableRdoQuant=1:rdoLevel=1:vbvBufferSize=2000:bitrate=3200000:zeroCopyMode=0" Anim_1_Qua_H264_3200.mp4
I used single pass because I assumed that most services want to minimize costs and would use a single-pass technique like capped CRF for encoding. Though both the Quadra and x264 support capped CRF, matching the bitrates was essential to a fair comparison, and this would be nearly impossible using capped CRF for both. As evaluated, all bitrates were within less than a 4% differential.
Quality Results
During our testing and VMAF comparisons, which I performed in the Moscow State University Video Quality Measurement Tool, I noticed that the first two seconds or so of many of the x264 encoded videos suffered from low quality. You can see this in Figure 3, the VQMT Results Plot that shows the VMAF score for each frame over both 20-second files; x264 in red and Quadra in green.
As you can see, the red x264 scores suffer a drop at the start and then recover at around 60 frames. Since this occurred in many clips, I excluded the first 60 frames from the VMAF calculation.
Table 1 shows the overall results in all test categories, which included from 1 – 4 clips. The Quadra bitrate column indicates that the data is within 98-102% of the bitrate of the x264 clip. I computed the VMAF score using the Harmonic mean method, which incorporates quality variations (see here). As you can see, rather than the typical premium content target of 94-95 VMAF points, we successfully mimicked the YouTube targets for H.264 encoded video.
Bitrate | VMAF | |||
x264 | Quadra | x264 | Quadra | |
Animation | 3,225 | 3,209 | 88.08 | 89.36 |
Cover Band | 3,245 | 3,237 | 89.55 | 91.92 |
How-to | 3,549 | 3,565 | 91.18 | 91.52 |
Lecture | 893 | 893 | 94.56 | 95.31 |
News Clips | 5,759 | 5,813 | 90.77 | 90.98 |
Sports | 3,126 | 3,120 | 91.06 | 91.99 |
TV | 6,849 | 6,842 | 91.39 | 91.65 |
Average | 3,807 | 3,811 | 90.94 | 91.82 |
Table 1. Comparative bitrates and VMAF scores using the harmonic mean.
Figure 4 shows the same data in graphic form. Not a huge difference, but Quadra clearly holds its own as compared to the x264 medium preset when encoding UGC content to UGC quality levels, which was the object of the exercise.
Mission accomplished from a quality perspective, let’s look at throughput.
Comparative Throughput
I compared throughput on a Dell Server equipped with a 2.2 GHz AMD Ryzen 5 5600x 6-core/12-thread CPU running Ubuntu 20.04.3 LTS with 16 GB of RAM. The Quadra VPU is a T1U device. I performed all tests using FFmpeg, using version 5.0 to drive the Quadra transcoder, and version 6.0 for x264. I used a 12-minute 1080p30 file as the source for all tests.
To compare throughput, I first needed to determine the number of simultaneous jobs that consumed about 100% of available resources. Any less than 100% and I’d be wasting resources; adding jobs after 100% might actually decrease throughput because the OS would have to juggle more tasks.
Quadra has a utility that displays system load on Quadra’s four main hardware components: decoder, encoder, scalar, and AI cores. You see this in Figure 5, with the encoder load at 99%, achieved when transcoding four simultaneous files (INST is short for instance). In this configuration, the system transcoded 640 frames per second. When I evaluated five simultaneous files, the throughput dropped to 632 frames per second. Accordingly, four simultaneous jobs were the most efficient configuration, and the Quadra encoded the four 12-minute source files in 2:15 (min:sec).
Figure 6 shows CPU utilization during this processing. The workstation has twelve cores, so the theoretical capacity is 1200%. You see that each Quadra instance of FFmpeg required about 5.5%, totaling just over 22% of total CPU usage or under 2% of the available 1200%. That’s because Quadra decodes the incoming H.264 file and transcodes the output file using on-board hardware using minimal system resources. This low CPU usage will become relevant in a few moments.
Contrast this with the CPU utilized by FFmpeg when encoding using the x264 codec, as shown in Figure 7. Here, CPU utilization totaled 1195%, or 99.6% of the available CPU resources. While the most efficient configuration for x264 was three simultaneous files, just one less than Quadra, the encoding time jumped from 2:15 with Quadra to 7:45 for FFmpeg and x264.
Figure 7. With three simultaneous transcodes with x264, FFmpeg consumed 99.6 of the available resources.This data feeds the calculations shown in Table 2. Quadra produced 48 minutes of video, or 80% of an hour, in 2:15, or 135 seconds. This means that Quadra can produce an hour of video in just under 169 seconds. As there are 86,400 seconds each day, this translates to 512 hours of encoded video per day.
Video Minutes | Files | % of hour | Encoding Time (seconds) | Seconds/ hour of encoded video | Seconds/day | Encoded Hours per day | |
Quadra | 12 | 4 | 80.00% | 135 | 168.75 | 86,400 | 512 |
CPU Only | 12 | 3 | 60.00% | 465 | 775 | 86,400 | 111 |
Table 2: Throughput in hours per day with Quadra and CPU-only with x264.
In contrast, encoding with the CPU-only and x264, FFmpeg produced 36 minutes of video, 60% of an hour, in 7:45, or 465 seconds. This translates to an hour of encoded video every 775 seconds, or 111 hours of encoded video per day. This means that Quadra produces about 4.6x the throughput of CPU-only transcoding with a single Quadra in the server.
Table 3 translates these figures into a three-year financial comparison. Here are the assumptions:
- Server cost – $5,000
- Quadra cost – $1,500
- Annual power cost at 500 watts draw at $0.25 equals $1,200
- Monthly co-location cost for 1RU rack is $75.
The top line shows that Quadra costs $12,800 over three years but produces 560,640 hours of video for a cost per encoded hour of $0.0228. The second line shows that the CPU-only system costs $1,500 less but outputs only 122,075 hours of video, for a cost of $0.0926.
The third line shows the hardware cost for the CPU-only systems to match the Quadra output, assuming that you could buy 4.6 systems, which obviously you can’t. This drives the total spend to $51,896, though the cost per hour obviously stays the same.
So, to produce 560,640 hours of encoded video over three years, you’d spend $12,800 for the Quadra-based system or $51,896 to produce via CPU-only. At these transcoding levels, Quadra delivers a 75% savings.
Three year | Capex | 3 year- OPEX | 3 Year Colo | Total | Hours Encoded | Cost/Hour |
Quadra | $6,500 | $3,600 | $2,700 | $12,800 | 560,640 | $0.0228 |
CPU Only | $5,000 | $3,600 | $2,700 | $11,300 | 122,075 | $0.0926 |
Match Quadra | $22,963 | $16,533 | $12,400 | $51,896 | 560,640 | $0.0926 |
Table 3. Three-year cost comparison, Quadra vs. CPU-only.
Boosting Capacity with the Quadra Video Server
Table 4 shows a higher-end use case where the publisher buys a Quadra server with ten T1Us installed. As we discussed in the article around Figure 5, since the CPU required for each Quadra FFmpeg instance is so low, a server can easily support ten T1Us or even more without reducing the throughput of each T1U. This means a system with ten T1Us can produce 10x the throughput in the same 1RU footprint and three-year OPEX and Colo cost. The Quadra Video Server costs $21,000 and has a three-year cost total of $27,300, with a cost per hour of $0.0049.
To match this with a CPU-only system, you’d have to purchase 45.93 systems, which we’ll assume you can do to keep the numbers clean. Your cost per hour is the same as Table 3, but your three-year spending jumps to $518,963, or $491,663 more than the Quadra system. At these load levels, the Quadra Video Server delivers about 95% savings.
Three year | Capex | 3 year- OPEX | 3 Year Colo | Total | Hours Encoded | Cost/Hour |
Quadra Server | $21,000 | $3,600 | $2,700 | $27,300 | 5,606,400 | $0.0049 |
CPU Only | $229,630 | $165,333 | $124,000 | $518,963 | 5,606,400 | $0.0926 |
Table 4. Comparing the Quadra server with CPU-only transcoding.
In short, Quadra delivers slightly higher quality than x264 medium in the tested configuration while saving as much as 95% of CAPEX and OPEX in high-volume use cases. You can see why YouTube and Meta produced their own ASICs.
That said, neither company sells its ASIC-based transcoders on the open market. If you’re a high-volume UGC site, you can achieve similar benefits by deploying the NETINT Quadra VPU, either as a standalone device(s) or integrated into the Quadra Video Server.