Deep Thoughts on AI Codecs and Encoders

This post focuses on AI in preprocessing and encoding products. I’ll examine two aspects: how to consider AI in encoding performance and how to consider AI in the user interface and operation. I’ll conclude by discussing where I see AI going in codec development over the next few years.

For perspective, I recently spoke with more than twenty companies for a presentation I gave at NAB 2024, Beyond the Hype: A Critical Look at AI in Video Streaming. You can watch the presentation and download the handout here. The blog also includes a list of all the companies I spoke with and links to the video interviews.

AI in Pre-Processing and Encoding – Performance

Products in this class use AI to deliver better compression efficiency in various ways. For example, these pre-processors/encoder controllers work in very different ways:

  • VisualOn’s Optimizer integrates directly with your encoder and uses AI-enhanced Content-Adaptive Encoding (CAE) to dynamically adjust encoder settings, frame by frame, for optimal bitrate and quality. You can also optionally pre-process video to improve contrast and clarity. VisualOn claims that the Optimizer can deliver savings “by an average of 40% and up to 70%.”
  • IMAX StreamSmart doesn’t preprocess video in any way. Rather, it integrates with your encoder to dynamically adjust the bitrate while maintaining quality based on the AI-driven quality metric IMAX ViewerScore. IMAX claims that StreamSmart can “reduce bitrate by 15% or more.”
  • Digital Harmonics KeyFrame is a pure preprocessing solution that transforms the bitstream before sending it to the encoder, with no KeyFrame/encoder integration. KeyFrame has two modes: Core and Boost. Core applies AI to preprocess video and reduce its entropy, simplifying the complexity of the video data to make it easier to encode. The optional Boost mode uses AI to enhance video appearance by de-noising, de-blocking, anti-aliasing, and upscaling. Digital Harmonics claims to deliver a “20% – 80% bandwidth reduction.”

These codecs and/or encoders also use AI differently.

  • Boasting machine learning integration as far back as 2016, Harmonic’s EyeQ encoder uses machine learning algorithms to optimize compression based on the human visual system, a CAE system focusing “bits where and when they matter most for the viewer and reducing bits where they matter less. In this way, EyeQ uses only the bits needed to hit quality targets, translating into more consistent VQ and significant bandwidth savings.” Harmonic claims to deliver bitrate savings of “up to 50%.”
  • Visionular CEO Zoe Lui has been publishing AI-related papers since 2018, if not earlier. Not surprisingly, Visionular’s codecs use AI in multiple ways, including AI-driven CAE, Intelligent Scene Classification, and Region-of-Interest CAE, which maintains high quality in important regions within the frame while reducing the overall bitrate. Visionular claims to “slash storage and CDN costs by > 50%.”
  • Codec Market, a cloud encoder, leverages the AI in Netflix’s VMAF metric to apply adaptive preprocessing and adjust the bitrate to deliver bandwidth savings while maintaining the quality target. With adaptive preprocessing, Codec Market claims to deliver savings “as high as 50, 60%.”
  • Deep Render bills itself as “the world’s only AI-based compression company.” Where all the other preprocessing and encoding solutions are using AI to produce a more efficient but fully standards-compliant bitstream that plays on existing hardware decoders in smart TVs, mobile devices, and computers, Deep Render is building a completely new codec that will require generic neural programming units for decode. Deep Render claims to already be “45% more efficient than VVC.”

Several points become clear after compiling this data.

  • In most products or services, the AI component is challenging, if not impossible, to measure. There are few AI switches you can enable or disable.
  • From a buying perspective, whole product performance is more important than the contribution delivered by AI. If all of IMAX’s 15% bandwidth improvement was delivered via AI (it wasn’t), but only 5% of EyeQ’s 50% improvement was AI-driven (it’s not), is IMAX a better product?

Obviously not. This is why during my presentation, I advised attendees to ignore AI when evaluating these products. Instead, evaluate whole product performance as you always have.

Figure 1. When it comes to encoding performance, ignore AI-related claims.

AI in Pre-Processing and Encoding – Interface Design

After mulling on this advice and wandering around the NAB show floor, I realized that the same “ignore-it” recommendation didn’t apply to interface design and usability. Rather, it became clear that many of the obvious and useful AI-related advancements would relate to interface design, where the AI component is more distinct. Here, it’s important not to ignore AI; you should consider the implications very carefully.

Generative AI in Telestream Vantage.
Figure 2. Generative AI in Telestream Vantage. Click to see the image at full resolution in another browser window.

What prompted this conclusion was a demo where a beta version of Telestream Vantage built encoding workflows from plain English prompts. You see this in Figure 2 where the text on the upper right created the workflow shown in the middle of the screen. Click the figure to view it at full resolution, and click here to watch the 3:00-minute video demo.

I started thinking about the different user interface generations and the skills they usurped. Generation 1 was scripting languages like FFmpeg. Generation 2 was GUI-based programs like Sorenson Squeeze, Handbrake, and dozens of others. Because of the GUI, you no longer needed to be a programmer to encode files.

Next came the drag-and-drop UI, which supplemented the GUI for workflow programs like Vantage and later id3as Norsk Studio. Drag and drop operation reduced the learning curve, allowing even those new to the program to build complex workflows.

Vantage’s generative AI addition reduces the learning curve even further. This is good if you’re a newbie in the space; it’s bad if you’re the gray hair who knows every nook and cranny of the Vantage interface. However, from a compression perspective, this appears mechanical as opposed to contextual. I’m guessing that Vantage doesn’t know which ladder to use for animation vs. sports, which codecs to use for mobile vs. SmartTV, and which DRMs to apply for delivery to the browser as compared to mobile.

Table 1 – generations of encoder interface design.

But that’s coming, and once it’s here, it will begin to usurp true compression-related knowledge. Does this sound far-fetched? I hope so, but remember how many producers used the encoding ladder in TN2224 verbatim just because Apple suggested it? By NAB 2025, some encoder from some company (Ateme? Harmonic? AWS Elemental? Dolby?) will present a questionnaire covering:

  • Source material
  • Target audience
  • Expected number of views
  • Requirements for captions and DRM
  • Preferences for bandwidth savings or minimized transcode costs

and several others. Then, the encoder will recommend a contextually accurate preset complete with ladder definition, codec and ABR format selection, HDR standards, captions, and DRM. A simple implementation of this feature wouldn’t even require AI, just an “if this, then that” collection of logic (for more on this, see Is the AI Powering Your AI-Powered Gear Really AI?). However, AI would add much more flexibility and accuracy.

Speaking for myself with my glass half full, this type of functionality would almost certainly reduce the quantity of configuration-related consulting projects that I enjoy, and perhaps demand for my next book, tentatively entitled, “The Last Streaming Compression Book Ever Needed.” At the very least, it makes me think twice before publishing research on the web, where all the AI engines will suck it in and use it against me.

Look me up at NAB 2025, and we’ll see if I was right.

AI Codec Development Going Forward

Let’s briefly touch on where I see codecs going over the next few years. One of the companies I spoke with, InterDigital, offers an AI development tool called Compress AI, “A PyTorch library and evaluation platform for end-to-end compression research.” During our discussion, Interdigital’s Fabien Racapé, who I interviewed, mentioned three ways Compress AI was being used.

  1. To improve existing codec performance with a bitstream compatible with existing hardware and software decoders (like Visionular, Harmonic, and Codec Market).
  2. To improve existing codec performance with an incompatible bitstream that requires new decoding hardware or software.
  3. To build a completely new codec (like Deep Render).

Note that Fabien didn’t indicate that any of these companies mentioned actually used Compress AI – I just added them as examples.

Within the context of broadcast streaming, the first use of AI benefits all most quickly, and in fact, it already does, based on the claims made by the companies listed above.

The second use makes little sense for broadcast since the decoder chip would have to penetrate smart TVs and mobile devices, a 6-8 year process. However, it might make sense in closed applications like autonomous cars or factory automation if using a portion of the legacy codec accelerated chip development as compared to starting from scratch or using NPUs.

The Case for the AI Codec

Let’s look at use case 3 using two lenses: video for machines and broadcast. Video for machines is a closed market, which means the manufacturer chooses both the encoder and decoder hardware. This eliminates the 6-8 year integration process necessary for broadcast or general-purpose streaming. Fabien did mention that several vendors were already developing codecs for video for machines.

As you may recall, before Deep Render, there was WaveOne, an AI-based codec company bought by Apple in March 2023. In December 2020, TechCrunch addressed AI’s suitability for autonomous cars and the use of NPUs for decoding. On the first, the author, Devin Coldewey opined,

A self-driving car, sending video between components or to a central server, could save time and improve video quality by focusing on what the autonomous system designates important — vehicles, pedestrians, animals — and not wasting time and bits on a featureless sky, trees in the distance, and so on.

This is not something that a traditional codec could readily do. For a white paper on WaveOne’s potential, see here. On using NPUs for decoding, Coldewey continues.

Just one problem: when you get a new codec, you need new hardware. But consider this: many new phones ship with a chip designed for running machine learning models, which like codecs can be accelerated, but unlike them the hardware is not bespoke for the model. So why aren’t we using this ML-optimized chip for video? Well, that’s exactly what WaveOne intends to do.

Interestingly, just two days ago, on May 7, 2024, Apple announced its MP4 CPU, reportedly with 38,000 TOPS of processing power. For perspective, Apple has been shipping NPUs with its smartphones since 2017. while I couldn’t verify the installed base that Deep Render CTO Arsalan Zafar mentioned in this video, it doesn’t matter in a closed market like autonomous cars (see here). Not to be snarky, but for all the hullabaloo about Apple supporting AV1 in hardware, Deep Render’s as-yet-unfinished codec has a six-year head start and enjoys much more hardware support than AV1.

Figure 2. Apple’s M4 Chip includes a very powerful NPU

What about AI codecs for broadcast? We’re at an interesting point in the evolution of codec adoption. Some might call it stalled. Consider:

  • Launched in 2018, AV1 is the most hyped codec of all time. Six years later, by the end of Q1 2024, it enjoyed only 8.5% penetration on mobile, buoyed by AVI recently released in two Apple phones. If you believe Reddit and Twitter, Google’s use of software decoding for AV1 streams from YouTube shortened mobile battery life and was discontinued. Though AV1 penetration is likely much greater in the living room, that’s almost exclusively due to YouTube’s policy of only distributing videos larger than 1080 in AV1 or VP9 format. Sorry, from the context of other publishers supporting AV1, YouTube support does not a revolution make.
  • Launched in 2020, VVC is only installed in a sprinkling of TV sets and no mobile phones (as of 12/2023). More than four years after VVC launched, the largest IP owner, Qualcomm, still hasn’t announced chip support (though it may be imminent).
  • All previous codecs opened new markets and were very quickly adopted. Neither VVC nor AV1 opened any new markets.
    • MPEG-2 – Digital TV
    • H.264 – Adobe Flash/mobile (particularly Apple)
    • HEVC – 4K/HDR
  • Regarding the competitive advantages or USPs for newer codecs,
    • AV1 – browser advantage over HEVC now closed, AV1 does retain an efficiency advantage, though that may be closing.
    • VVC – 8K TVs makes little sense in most living rooms and is effectively banned in Europe.
    • VVC – VR/AR hasn’t meaningfully taken off; Apple Vision Pro uses MV-HEVC even though Apple owns significant VVC IP.
    • VVC is gaining some traction for software decode on mobile, but AFAIK, its public uses are mostly by VVC patent owners like Tencent, ByteDance, and Kwai. This won’t be convincing to independent publishers, particularly with the content royalties on the horizon.
  • Royalties are stacking for hardware vendors – H.264 (soon off patent)/HEVC and now VVC, with the potential for VP9 and AV1. This makes it less likely that CE and mobile vendors will incorporate new codecs that don’t deliver new markets.
  • Content royalties are a reality for the first time (see Avanci Video – HEVC/VP9/AV1/VVC), Broadcom vs. Netflix), throwing true FUD into the codec adoption decision for publishers.

This was the vibe I took back from NAB, though I’ve been nose-deep in the legal side and could be overestimating the impact content royalties will have. On the other hand, when considering a codec that uses an NPU for decode:

  • One NPU should support multiple codecs and multiple generations of codecs.
  • One NPU should support multiple other ML apps, which will only become more important over the next few years.
  • NPUs don’t appear royalty-bearing, though the decoders that run on them likely will be.

That said, Deep Render appears to be in the lead for an AI-based codec but won’t ship one until mid-2015 or so. It’s tough to bet on a codec that’s a year or more away from being available. Though Deep Render is the most visible, you have to guess that multiple other companies, some small, some large, are developing AI-based codec technology that leverages NPUs.

Regarding NPUs, at some point in the near term, you’d expect all new edge devices, whether smart TV, OTT device, mobile, or computer, to ship with one, not for video decode but for general-purpose ML functionality. Perhaps at the same time, perhaps later, the same manufacturer will have to choose between VVC or H.267 or a codec that plays using the NPU that’s already in the bill of materials. It could be that VVC is the last codec that requires a dedicated decoder (or decoder gates in a chip) and that all future codecs will decode on NPUs.

Interestingly, the one thing we know for sure is that all smart TV vendors and OTT devices will support whatever codec YouTube supports, as YouTube is the single most important must-have channel. Beyond that, respecting AI-based codecs and NPUs vs VVC/267/AV1/AV2, it’s too early to tell.

About Jan Ozer

Avatar photo
I help companies train new technical hires in streaming media-related positions; I also help companies optimize their codec selections and encoding stacks and evaluate new encoders and codecs. I am a contributing editor to Streaming Media Magazine, writing about codecs and encoding tools. I have written multiple authoritative books on video encoding, including Video Encoding by the Numbers: Eliminate the Guesswork from your Streaming Video (https://amzn.to/3kV6R1j) and Learn to Produce Video with FFmpeg: In Thirty Minutes or Less (https://amzn.to/3ZJih7e). I have multiple courses relating to streaming media production, all available at https://bit.ly/slc_courses. I currently work as www.netint.com as a Senior Director in Marketing.

Check Also

There are no codec comparisons. There are only codec implementation comparisons.

I was reminded of this recently as I prepared for a talk on AV1 readiness …

Seedtag: Harnessing AI for Contextual Audience Targeting

Cookies are gone from Safari and Firefox, and on their way out in Chrome. This …

Why That Amazon Product Follows You Everywhere: A Look at Behavioral Tracking

We’ve all experienced it—you check out a product on Amazon, and before you know it, …

Leave a Reply

Your email address will not be published. Required fields are marked *