This article details a methodology for comparing hardware transcoders considering cost/stream, watt/stream, and output quality.
If you’ve ever benchmarked software codecs, you know the quality/throughput tradeoff; simply stated, the higher the quality, the lower the throughput. In contrast, for many first-generation hardware encoders, throughput was prioritized, but the quality was fixed; you got what you got.
Finding the Key Quality Controls
Most next-gen hardware encoders offer presets or other switches to optimize quality at a cost to throughput that can be even more striking than with software. In comparing specifications for encoders, remember the quality/throughput tradeoff. And when you see quality stats, think, “hmm, at what throughput?” Or, if you see throughput stats, ask, “at what quality?”
Whenever you test a hardware encoder, you should start by identifying the configuration options that most impact quality and throughput and then test across a range of configurations to get a sense of the performance/quality tradeoff. If you plug in pricing and power consumption figures, you can also easily compute the cost per stream and watts per stream. This is the CAPEX and OPEX side of the equation.
Then you can choose the “operating point” that delivers the optimum blend of quality and throughput for your applications. When comparing multiple encoders, you should perform the same analysis for each to enable a complete apples-to-apples comparison.
Recently, I benchmarked the performance of NETINT’s Quadra Video Processing Unit (VPU) against the NVIDIA T4. In this post, I’ll review the testing and the Quadra results to give you a feel for the hardware evaluation process. In a future post, I’ll review the NVIDIA findings and compare the two.
Briefly, Quadra is NETINT’s newest ASIC-based transcoder, called a VPU, because it has onboard decoding, scaling, encoding, and overlay, plus an 18 TOPS AI engine. The VPU can create encoded bitstreams in H.264, HEVC, and AV1.
Quadra has two major configuration options that impact quality, lookahead buffer, and rate-distortion optimization.
Briefly, the lookahead buffer allows the encoder to look at frames ahead of the frame being encoded, so it knows what’s coming and can make more intelligent decisions. This improves encoding quality, particularly at/around scene changes, and it can improve bitrate efficiency. But, lookahead adds latency equal to the lookahead duration, and it can decrease throughput.
Table 1 shows the impact of a 40-frame lookahead buffer when encoding to the H.264 format. The top-line harmonic mean VMAF score is 2.3 points lower, which is borderline significant. But the low-frame differential of almost 16 points could predict transient problems that might be apparent to some viewers. But in addition to injecting 1.3 seconds of latency into the process, you see that the lookahead cuts the throughput by 33%, from 36 1080p streams to 24.
Rate distortion optimization (RDO) functions like most presets and adjusts several parameters that impact both quality and throughout, with higher values increasing quality and reducing throughput. With H.264 output, Quadra offers one level of RDO, while with HEVC, there are three levels, 1, 2, and 3.
H264 Performance, Cost, and Power Consumption
Table 2 shows the range of H.264 options tested during the recent benchmarking. LA is lookahead, and I tested three values, 40, 20, and 0. I also tested with RDO on and off. To provide some perspective of quality, the x264 Quality Equivalent shows x264 quality encoded using the same parameters using the presets shown.
At the highest quality setting, Quadra’s output quality slightly exceeded that of x264 using the slow preset, and the unit produced 16 1080p streams. You see that dropping the lookahead from 40 to 20 with RDO disabled had little impact on quality or throughput but cut latency by 0.66 sec, making that choice easy for latency-sensitive events.
At the lowest possible quality setting, Quadra’s quality dropped to slightly better than veryfast quality, which is often the x264 preset used for live applications to ensure at least nominal throughput with CPU-only transcoding.
At this quality level, the VPU outputs 36 1080p streams. By adding the cost per stream and watts per stream data, you will get a true feel for the comparative CAPEX and OPEX costs produced by all settings combinations.
HEVC Performance, Cost, and Power Consumption
Table 3 shows the same data for HEVC transcoding using the same lookahead options and RDO at 1, 2, 3, and 0 (disabled). At the highest quality levels, the output quality nearly matched the x265 encoder using the slow setting but only produced four streams. At the other end of the spectrum, output quality nearly matched x265 using the very fast preset, but the Quadra produced 40 1080p 30 streams, four more than using the H.264 format.
There are several new hardware encoders coming, and their launches will be accompanied by aggressive claims about quality and throughput. My recommendation is not to assume that the same settings were used for both. In short, you better do your own testing. Trust but verify comes to mind.
When you perform your own testing, remember the methodology explained above:
- Identify the most critical quality-related options for your specific application.
- Test across a range of configurations from high quality/low throughput to low quality/high throughput.
- Compute quality, cost per stream, and watts per stream at the operating point to compare against other technologies. Remember to factor in the CAPEX of the additional servers required to run a software encoding service.
In the next post, we’ll share quality results from the NVIDIA T4 GPU and compare them to Quadra.