This case study compares CAPEX, power consumption, and co-location costs for GPU, CPU, and ASIC-based transcoding.
This case study was derived from Ilya Mikhaelis‘ recent talk at NETINT’s symposium on Building Your Live Streaming Cloud. Images have been consolidated and modified slightly. You can view Ilya’s talk and download his slides here: bit.ly/vovs.
Ilya Mikhaelis is the streaming backend tech lead for Mayflower, which builds and hosts streaming infrastructures for multiple publishers. Mayflower’s infrastructure handles over 10,000 incoming streams and over one million outgoing streams at a latency that averages one to two seconds.
Ilya’s challenge was to find the most cost-effective technology to transcode the incoming streams. His journey took him from CPU-based transcoding to GPU and two generations of ASIC-based transcoding. These transitions slashed total production transcoding costs from $10 million dollars to just under $1.5 million dollars while reducing power consumption by over 90%, from 325,000 watts to 33,820 watts.
Ilya’s rigorous textbook-worthy testing methodology and findings are invaluable to any video engineer seeking the highest quality transcoding technology at the lowest capital cost and most efficient power usage. But let’s start at the beginning.
Contents
The Mayflower Internal CDN
As Ilya describes it, “Mayflower is a big company, under which different projects stand. And most of these projects are about high-load, live media streaming. Moreover some of Mayflower resources were included in the top 50 of the most visited sites worldwide. And all these streaming resources are handled by one internal CDN, which was completely designed and implemented by my team.”
Describing the requirements, Ilya added, “The typical load of this CDN is about 10,000 incoming simultaneous streams and more than one million outgoing simultaneous streams worldwide. In most cases, we target a latency of one to two seconds. We try to achieve a real-time experience for our content consumers, which is why we need a fast and effective transcoding solution.”
To build the CDN, Mayflower used bare metal servers to maximize network and resource utilization and run a high-performance profile to achieve stable stream processing and keep encoder and decoder queues around zero. As shown in Figure 1, the CDN inputs streams via WebRTC and RTMP and delivers with a mix of WebRTC, HLS, and low latency HLS. It uses customized WebRTC inside the CDN to achieve minimum latency between servers.
Figure 1. Mayflower’s Low Latency CDN.
Ilya’s team minimizes resource wastage by implementing all high-level network protocols, like WebRTC, HLS, and low latency HLS, on their own. They use libav, a component of ffmpeg, as a framework for transcoding inside their transcoder servers.
The Transcoding Pipeline
In Mayflowers’ transcoding pipeline (Figure 2), the system inputs a single WebRTC stream, which it converts to a five-rung encoding ladder. Mayflower uses a mixture of proprietary and libav filters to achieve a stable frame rate and stable load. The stable frame rate is essential for outgoing streams because some protocols, like low latency HLS or HLS, can’t handle variable frame rates, especially on Apple devices.
Figure 2. Mayflower’s Low Latency CDN.
CPU-Only Transcoding – Too Expensive, Too Much Power
After creating the architecture, Ilya had to find a transcoding technology as quickly as possible. Mayflower initially transcoded on a Dell R940, which currently costs around $20,000 as configured for Mayflower. When Ilya’s team first implemented software transcoding, most content creators input at 720p. After a few months, as they became more familiar with the production operation, most switched to 1080p, dramatically increasing the transcoding load.
You see the numbers in Figure 3. Each server could produce only 20 streams, which at a server cost of $20,000 meant a per stream cost of $1,000. At this capacity, scaling up to handle the 10,000 incoming streams would require 500 servers at a total cost of $10,000,000.
Total power consumption would equal 500 x 650, or 325,000 watts. The Dell R940 is a 3RU server; at an estimated monthly cost of $125 for colocation, this would add $750,000 per year.
Figure 3. CPU-only transcoding was very costly and consumed excessive power.
These numbers caused Ilya to pause and reassess. “After all these calculations, we understood that if we wanted to play big, we would need to find a cheaper transcoding solution the CPU-only with higher density per server, while maintaining low latency. So, we started to research and found some articles on companies like Wowza, Xilinx, Google, Twitch, YouTube, and so on. And the first hint was GPU. And when you think GPU, you think NVIDIA, a company all streaming engineers are aware of.”
GPUs – Better, But Still Too Expensive
Ilya initially considered three NVIDIA products: the Tesla V100, Tesla P100, and Tesla T4. The first two, he concluded, were best for machine learning, leaving the T4 as the most relevant option. Mayflower could install six T4s into each existing Dell server. At a current cost of around $2,000 for each T4, this produced a total cost of $32,000 per server.
Under capacity testing, the T4-enabled system produced 96 streams, dropping the per-stream cost to $333. This also reduced the required number of servers to 105 and the total CAPEX cost to $3,360,000.
With the T4s installed, power consumption increased to 1,070 watts for a total of 112,350 watts. At $125 per month per server, the 105 servers would cost $157,500 annually to house in a colocation facility.
Figure 4. Capacity and costs for an NVIDIA T4-based solution.
Round 1 ASICs: The NETINT T432
The NVIDIA numbers were better, but as Ilya commented, “It looked like we found a possible candidate, but we had a strong sense that we needed to further our research. We decided to continue our journey and found some articles about a company named NETINT and their ASIC-based solutions.”
Mayflower first ordered and tested the T432 video transcoder, which contains four NETINT G4 ASICs in a single PCIe card. As detailed by Ilya, “We received the T432 cards, and the results were quite exciting because we produced about 25 streams per card. Power consumption was much lower than NVIDIA, only 27 watts per card, and the cards were cheaper. The whole server produced 150 streams in full HD quality, with a power consumption of 812 watts. For the whole production, we’ll pay about 2 million, which is much cheaper than NVIDIA solution.”
You see all this data in Figure 5. The total number of T432-powered servers drops to 67, which reduces total power to 54,404 watts and annual colocation to $100,500.
Figure 5. Capacity and costs for the NETINT T432 solution.
While costs and power consumption kept improving, Ilya noticed that the CDN’s internal queue started increasing when processing with T432-equipped systems. Initially, Ilya thought the problem was the lack of onboard scaling on the T432, but then he noticed that “even when producing all these ABR ladders, our CPU load was about only 40% during high load hours. The bottleneck was the card’s decoding and encoding capacity, not onboard scaling.”
Finally, he pinpointed the increase in the internal queue to the fact that the T432’s decoder couldn’t maintain 4K60 fps decode for H.264 input. This was unacceptable because it increased stream latency. Ilya went searching one last time; fortunately, the solution was close at hand.
Round 2 ASICs: The NETINT Quadra T2 – The Transcoding Monster
Ilya next started testing with the NETINT Quadra T2 video processing unit, or VPU, which contains two NETINT G5 chips in a PCIe card. As with the other cards, Ilya could install six in each Dell server.
Ilya’s team liked what they found. “All those disadvantages were eliminated in the new NETINT card – Quadra. It has a hardware scaler inside with an optimized pipeline: decoder – scaler – encoder in the same VPU. And H264 4K60 decoding is not a problem for it. We have already tested this card and have added servers with Quadra to our production. It really seems to be a transcoding monster.”
Figure 6 shows the performance and cost numbers. Equipped with the six T2 VPUs, each server could output 270 streams, reducing the number of required servers from 500 for CPU-only to a mere 38. This dropped the per stream cost to $141, less than half of the NVIDIA T4 equipped system, and cut the total CAPEX down to $1,444,000. Total power consumption dropped to 33,820 watts, and annual colocation costs for the 38 3U servers were $57,000.
Figure 6. Capacity and costs for the NETINT Quadra T2 solution.
Cost and Power Summary
Figure 7 presents a summary of costs and power consumption, and the numbers speak for themselves. In Ilya’s words, “It is obvious that Quadra T2 dominates by all characteristics, and according to our team experience, it is the best transcoding solution on the market today.”
Figure 7. Summary of costs and power consumption.
Ilya also commented on the suitability of the Dell R940 system. “I want to emphasize that the DELL R940 isn’t the best server for VPU and GPU transcoders. It has a small density of PCIe slots and, as a result, a small density of VPU/GPU. Moreover, in the case of Quadra and even T432, you don’t need such powerful CPUs.”
In terms of other servers to consider, Ilya stated, “Nowadays, you may find platforms on the market with even 16 PCIe slots. In such systems, especially if you use Quadra, you don’t need powerful CPUs inside because everything is done inside the VPU. But for us, it was a legacy with which we needed to live.”
Video engineers seeking the optimal transcoding solution can take a lot from Ilya’s transcoding journey: a willingness to test a range of potential solutions, a rigorous focus on cost and power consumption per stream, and extreme attention to detail. The NETINT folks are confident that this approach will lead you to precisely the same conclusion as Ilya, that the Quadra T2 is “the best transcoding solution on the market today.”